From 4521bb65f3f1c0dc8fed96cbcac0388210063694 Mon Sep 17 00:00:00 2001 From: Helena Kloosterman Date: Tue, 28 Jan 2025 15:35:45 +0100 Subject: [PATCH 01/15] do_sample=False for NPU in chat_sample, add NPU to README (#1637) - make chat_sample work out of the box on NPU by forcing do_sample=False for NPU - add NPU info to text_generation samples README and a small unrelated change: - change `pip install` command for exporting models that are already on huggingface-hub. No need to install all of PyTorch and transformers if you only need to download a model. --- samples/cpp/text_generation/README.md | 13 ++++++++++++- samples/python/text_generation/README.md | 13 ++++++++++++- 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/samples/cpp/text_generation/README.md b/samples/cpp/text_generation/README.md index f370c74a80..dd24b6ebf5 100644 --- a/samples/cpp/text_generation/README.md +++ b/samples/cpp/text_generation/README.md @@ -19,7 +19,7 @@ optimim-cli export openvino --model ``` If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli. ```sh -pip install --upgrade-strategy eager -r ../../export-requirements.txt +pip install huggingface-hub huggingface-cli download --local-dir ``` @@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}", ``` +#### NPU support + +NPU device is supported with some limitations. See [NPU inference of +LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular: + +- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model `). + For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`). +- Beam search and parallel sampling are not supported. +- Use OpenVINO 2025.0 or later (installed by deployment-requirements.txt, see "Common information" section), and the latest NPU driver. + + ### 2. Greedy Causal LM (`greedy_causal_lm`) - **Description:** Basic text generation using a causal language model. diff --git a/samples/python/text_generation/README.md b/samples/python/text_generation/README.md index 84b5302639..97a6ad59bc 100644 --- a/samples/python/text_generation/README.md +++ b/samples/python/text_generation/README.md @@ -19,7 +19,7 @@ optimim-cli export openvino --model ``` If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli. ```sh -pip install --upgrade-strategy eager -r ../../export-requirements.txt +pip install huggingface-hub huggingface-cli download --local-dir ``` @@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}", ``` +#### NPU support + +NPU device is supported with some limitations. See [NPU inference of +LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular: + +- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model `). + For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`). +- Beam search and parallel sampling are not supported. +- Use OpenVINO 2025.0 or later (installed by deployment-requirements.txt, see "Common information" section), and the latest NPU driver. + + ### 2. Greedy Causal LM (`greedy_causal_lm`) - **Description:** Basic text generation using a causal language model. From 4fb48deed27d23cbb517eb1247e8f1f48bdc8596 Mon Sep 17 00:00:00 2001 From: Vishniakov Nikolai Date: Tue, 28 Jan 2025 16:08:48 +0100 Subject: [PATCH 02/15] [JS] Add GenAI Node.js bindings (#1193) Adding Node.js bindings for GenAI pipelines. - 155187 - 158132 ## Limitations Current version it's primary backbone of future development. Supports bindings of `LLMPipeline` only. ## TODO - [x] Test build configuration - [x] Integrate unit tests run into GHA - [ ] Add script to download runtime binaries (after binaries publication) --------- Co-authored-by: Vladimir Zlobin --- .github/workflows/linux.yml | 148 ++++-- CMakeLists.txt | 5 +- cmake/features.cmake | 7 + pyproject.toml | 2 +- samples/js/text_generation/.gitignore | 1 + samples/js/text_generation/README.md | 48 ++ samples/js/text_generation/chat_sample.js | 54 ++ samples/js/text_generation/package-lock.json | 42 ++ samples/js/text_generation/package.json | 15 + .../js/text_generation/tests/usage.test.js | 62 +++ src/CMakeLists.txt | 4 + src/cpp/CMakeLists.txt | 35 +- src/js/.gitignore | 6 + src/js/.npmignore | 15 + src/js/CMakeLists.txt | 93 ++++ src/js/README.md | 56 +++ src/js/include/addon.hpp | 20 + src/js/include/helper.hpp | 23 + .../llm_pipeline/finish_chat_worker.hpp | 18 + src/js/include/llm_pipeline/init_worker.hpp | 21 + .../llm_pipeline/llm_pipeline_wrapper.hpp | 27 + .../llm_pipeline/start_chat_worker.hpp | 18 + src/js/lib/bindings.cjs | 1 + src/js/lib/module.js | 141 ++++++ src/js/package-lock.json | 470 ++++++++++++++++++ src/js/package.json | 30 ++ src/js/src/addon.cpp | 30 ++ src/js/src/helper.cpp | 53 ++ .../src/llm_pipeline/finish_chat_worker.cpp | 14 + src/js/src/llm_pipeline/init_worker.cpp | 18 + .../src/llm_pipeline/llm_pipeline_wrapper.cpp | 153 ++++++ src/js/src/llm_pipeline/start_chat_worker.cpp | 14 + src/js/tests/bindings.test.js | 58 +++ src/js/tests/models.js | 3 + src/js/tests/module.test.js | 142 ++++++ src/js/tests/setup.js | 6 + src/js/tests/utils.js | 47 ++ src/js/thirdparty/node-lib.def | 147 ++++++ src/js/thirdparty/win_delay_load_hook.cc | 52 ++ 39 files changed, 2062 insertions(+), 37 deletions(-) create mode 100644 samples/js/text_generation/.gitignore create mode 100644 samples/js/text_generation/README.md create mode 100644 samples/js/text_generation/chat_sample.js create mode 100644 samples/js/text_generation/package-lock.json create mode 100644 samples/js/text_generation/package.json create mode 100644 samples/js/text_generation/tests/usage.test.js create mode 100644 src/js/.gitignore create mode 100644 src/js/.npmignore create mode 100644 src/js/CMakeLists.txt create mode 100644 src/js/README.md create mode 100644 src/js/include/addon.hpp create mode 100644 src/js/include/helper.hpp create mode 100644 src/js/include/llm_pipeline/finish_chat_worker.hpp create mode 100644 src/js/include/llm_pipeline/init_worker.hpp create mode 100644 src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp create mode 100644 src/js/include/llm_pipeline/start_chat_worker.hpp create mode 100644 src/js/lib/bindings.cjs create mode 100644 src/js/lib/module.js create mode 100644 src/js/package-lock.json create mode 100644 src/js/package.json create mode 100644 src/js/src/addon.cpp create mode 100644 src/js/src/helper.cpp create mode 100644 src/js/src/llm_pipeline/finish_chat_worker.cpp create mode 100644 src/js/src/llm_pipeline/init_worker.cpp create mode 100644 src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp create mode 100644 src/js/src/llm_pipeline/start_chat_worker.cpp create mode 100644 src/js/tests/bindings.test.js create mode 100644 src/js/tests/models.js create mode 100644 src/js/tests/module.test.js create mode 100644 src/js/tests/setup.js create mode 100644 src/js/tests/utils.js create mode 100644 src/js/thirdparty/node-lib.def create mode 100644 src/js/thirdparty/win_delay_load_hook.cc diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml index 98ac356e11..27b8355ce6 100644 --- a/.github/workflows/linux.yml +++ b/.github/workflows/linux.yml @@ -42,7 +42,7 @@ jobs: runs-on: aks-linux-2-cores-8gb container: image: 'openvinogithubactions.azurecr.io/openvino_provider:0.1.0' - volumes: + volumes: - /mount:/mount - ${{ github.workspace }}:${{ github.workspace }} @@ -114,11 +114,11 @@ jobs: cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -S ${{ env.SRC_DIR}} -B ${{ env.BUILD_DIR }} cmake --build ${{ env.BUILD_DIR}} --config ${{ matrix.build-type }} --parallel $(nproc) --verbose cmake --install ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --prefix ${{ env.INSTALL_DIR }} - + - name: Pack Artifacts run: tar -cvf - * | pigz > ${{ env.BUILD_DIR }}/${{ env.GENAI_ARCHIVE_NAME }} working-directory: ${{ env.INSTALL_DIR }} - + - name: Upload Archive Distribution Package if: ${{ always() }} uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3 @@ -137,7 +137,7 @@ jobs: runs-on: aks-linux-4-cores-16gb container: image: openvinogithubactions.azurecr.io/ov_build/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }} - volumes: + volumes: - /mount:/mount - ${{ github.workspace }}:${{ github.workspace }} options: -e SCCACHE_AZURE_BLOB_CONTAINER -e SCCACHE_AZURE_CONNECTION_STRING @@ -161,7 +161,7 @@ jobs: name: ${{ needs.openvino_download.outputs.ov_artifact_name }} path: ${{ env.OV_INSTALL_DIR }} merge-multiple: true - + - name: Build Tokenizers Wheel run: | python -m pip wheel -v --no-deps --wheel-dir ${{ env.WHEELS_DIR }} \ @@ -169,7 +169,7 @@ jobs: ${{ needs.openvino_download.outputs.ov_wheel_source }} \ ${{ env.SRC_DIR }}/thirdparty/openvino_tokenizers working-directory: ${{ env.OV_INSTALL_DIR }} - + - name: Build GenAI Wheel run: | python -m pip wheel -v --no-deps --wheel-dir ${{ env.WHEELS_DIR }} \ @@ -177,11 +177,11 @@ jobs: ${{ needs.openvino_download.outputs.ov_wheel_source }} \ ${{ env.SRC_DIR }} working-directory: ${{ env.OV_INSTALL_DIR }} - + - name: Build WWB Wheel run: python -m pip wheel -v --no-deps --wheel-dir ${{ env.WHEELS_DIR }} ${{ env.SRC_DIR }}/tools/who_what_benchmark working-directory: ${{ env.OV_INSTALL_DIR }} - + - name: Upload Wheels if: ${{ always() }} uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3 @@ -189,7 +189,7 @@ jobs: name: genai_wheels path: ${{ env.INSTALL_DIR }} if-no-files-found: 'error' - + genai_build_samples: name: Build Samples - ${{ matrix.build-type }} strategy: @@ -204,7 +204,7 @@ jobs: runs-on: aks-linux-2-cores-8gb container: image: openvinogithubactions.azurecr.io/ov_build/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }} - volumes: + volumes: - /mount:/mount - ${{ github.workspace }}:${{ github.workspace }} options: -e SCCACHE_AZURE_BLOB_CONTAINER -e SCCACHE_AZURE_CONNECTION_STRING @@ -228,17 +228,17 @@ jobs: pattern: "{${{ needs.openvino_download.outputs.ov_artifact_name }},genai_archive_${{ matrix.build-type }}}" path: ${{ env.OV_INSTALL_DIR }} merge-multiple: true - + - name: Extract Artifacts run: pigz -dc ${{ env.GENAI_ARCHIVE_NAME }} | tar -xf - -C ${{ env.OV_INSTALL_DIR }} working-directory: ${{ env.OV_INSTALL_DIR }} - + - name: Build Samples (Release) if: ${{ 'Release' == matrix.build-type }} run: | chmod +x ${{ env.OV_INSTALL_DIR }}/samples/cpp/build_samples.sh ${{ env.OV_INSTALL_DIR }}/samples/cpp/build_samples.sh -i ${{ env.INSTALL_DIR }} - + - name: Build Samples (${{ matrix.build-type }}) if: ${{ 'Release' != matrix.build-type }} run: | @@ -246,7 +246,7 @@ jobs: cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -S ${{ env.OV_INSTALL_DIR }}/samples/cpp/ -B ${{ env.BUILD_DIR }} cmake --build ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --parallel $(nproc) cmake --install ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --component samples_bin --prefix ${{ env.INSTALL_DIR }} - + - name: Pack Artifacts run: tar -cvf - * | pigz > ${{ env.INSTALL_DIR }}/${{ env.GENAI_SAMPLES_NAME }} working-directory: ${{ env.INSTALL_DIR }} @@ -258,7 +258,7 @@ jobs: name: genai_samples_${{ matrix.build-type }} path: ${{ env.INSTALL_DIR }}/*.tar.gz if-no-files-found: 'error' - + genai_tests_wheel: name: Python (${{ matrix.test.name}}) Tests (wheel) needs: [ openvino_download, genai_build_wheel ] @@ -279,7 +279,7 @@ jobs: runs-on: aks-linux-4-cores-16gb container: image: openvinogithubactions.azurecr.io/ov_test/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }} - volumes: + volumes: - /mount:/mount - ${{ github.workspace }}:${{ github.workspace }} @@ -289,39 +289,39 @@ jobs: BUILD_DIR: ${{ github.workspace }}/build TRANSFORMERS_CACHE: ${{ github.workspace }}/models # Hugging Face transformers cache HF_HOME: ${{ github.workspace }}/datasets # Hugging Face datasets cache - + steps: - name: Clone openvino.genai uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 with: path: ${{ env.SRC_DIR }} submodules: recursive - + - name: Download Build Artifacts uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8 with: pattern: "{${{ needs.openvino_download.outputs.ov_artifact_name }},genai_wheels}" path: ${{ env.INSTALL_DIR }} merge-multiple: true - + - name: Install GenAI Wheels uses: ./src/.github/actions/install_wheel with: packages: "openvino;openvino_tokenizers[transformers];openvino_genai;whowhatbench" requirements_files: "${{ env.SRC_DIR }}/tests/python_tests/requirements.txt" local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels - + - name: Tests run: python -m pytest -v ./${{ matrix.test.cmd }} working-directory: ${{ env.SRC_DIR }} - + genai_samples_tests: name: Samples Tests - ${{ matrix.build-type }} strategy: fail-fast: false matrix: build-type: [Release] - needs: [ openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples ] + needs: [ openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples ] timeout-minutes: 45 defaults: run: @@ -329,7 +329,7 @@ jobs: runs-on: aks-linux-2-cores-8gb container: image: openvinogithubactions.azurecr.io/ov_test/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }} - volumes: + volumes: - /mount:/mount - ${{ github.workspace }}:${{ github.workspace }} @@ -338,41 +338,41 @@ jobs: SRC_DIR: ${{ github.workspace }}/src BUILD_DIR: ${{ github.workspace }}/build MODELS_DIR: ${{ github.workspace }}/models - + steps: - name: Clone openvino.genai uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 with: path: ${{ env.SRC_DIR }} submodules: recursive - + - name: Download Build Artifacts uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8 with: pattern: "{${{ needs.openvino_download.outputs.ov_artifact_name }},genai_archive_${{ matrix.build-type }},genai_samples_${{ matrix.build-type }},genai_wheels}" path: ${{ env.INSTALL_DIR }} merge-multiple: true - + - name: Extract Artifacts run: | pigz -dc ${{ env.GENAI_ARCHIVE_NAME }} | tar -xf - -C ${{ env.INSTALL_DIR }} pigz -dc ${{ env.GENAI_SAMPLES_NAME }} | tar -xf - -C ${{ env.INSTALL_DIR }} working-directory: ${{ env.INSTALL_DIR }} - + - name: Install Wheels uses: ./src/.github/actions/install_wheel with: packages: "openvino;openvino_tokenizers[transformers];openvino_genai" requirements_files: "${{ env.SRC_DIR }}/samples/requirements.txt" local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels - + - name: Download & convert Models and data run: | mkdir -p ${{ env.MODELS_DIR }} optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 ${{ env.MODELS_DIR }}/TinyLlama-1.1B-Chat-v1.0 optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny ${{ env.MODELS_DIR }}/whisper-tiny wget https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/librispeech_s5/how_are_you_doing_today.wav -O ${{ env.MODELS_DIR }}/how_are_you_doing_today.wav - + - name: Test multinomial_causal_lm.py if: ${{ 'Release' == matrix.build-type }} # Python bindings can be built in Release only timeout-minutes: 1 @@ -384,10 +384,10 @@ jobs: timeout-minutes: 1 run: ${{ env.INSTALL_DIR }}/samples/python/whisper_speech_recognition/whisper_speech_recognition.py ./whisper-tiny/ how_are_you_doing_today.wav working-directory: ${{ env.MODELS_DIR }} - + - name: C++ Tests Prerequisites run: python -m pip uninstall openvino openvino-tokenizers openvino-genai -y - + - name: Test greedy_causal_lm run: | source ${{ env.INSTALL_DIR }}/setupvars.sh @@ -400,9 +400,93 @@ jobs: ${{ env.INSTALL_DIR }}/samples_bin/whisper_speech_recognition ./whisper-tiny/ how_are_you_doing_today.wav working-directory: ${{ env.MODELS_DIR }} + genai_build_nodejs_bindings: + name: Build Node.js bindings + strategy: + fail-fast: false + matrix: + build-type: [Release] + needs: [ openvino_download ] + timeout-minutes: 20 + defaults: + run: + shell: bash + runs-on: aks-linux-4-cores-16gb + container: + image: openvinogithubactions.azurecr.io/ov_build/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }} + volumes: + - /mount:/mount + options: -e SCCACHE_AZURE_BLOB_CONTAINER -e SCCACHE_AZURE_CONNECTION_STRING -v ${{ github.workspace }}:${{ github.workspace }} + env: + CMAKE_GENERATOR: Unix Makefiles + OV_INSTALL_DIR: ${{ github.workspace }}/ov + BUILD_DIR: ${{ github.workspace }}/build + SRC_DIR: ${{ github.workspace }}/src + + steps: + - name: Clone openvino.genai + uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + with: + path: ${{ env.SRC_DIR }} + submodules: recursive + + - name: Download OpenVINO package + uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8 + with: + name: ${{ needs.openvino_download.outputs.ov_artifact_name }} + path: ${{ env.OV_INSTALL_DIR }} + merge-multiple: true + + - name: Build with ENABLE_JS=ON + run: | + source ${{ env.OV_INSTALL_DIR }}/setupvars.sh + cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -DENABLE_JS=ON -S ${{ env.SRC_DIR }} -B ${{ env.BUILD_DIR }} + cmake --build ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --parallel $(nproc) --verbose + cmake --install ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --prefix ${{ env.OV_INSTALL_DIR }} + + - name: Combine binaries for Node.js package + run: | + mkdir -p nodejs + cp -r runtime/lib/intel64/* nodejs + cp -r runtime/3rdparty/tbb/lib/* nodejs + cp genai_node_addon.node nodejs + GENAI_VERSION=$(grep -oP '(?<=CMAKE_PROJECT_VERSION:STATIC=)[^"]*' ${{ env.BUILD_DIR }}/CMakeCache.txt) + OV_VERSION=$(echo $GENAI_VERSION | sed 's/..$//') + patchelf --set-rpath '$ORIGIN' nodejs/libopenvino.so.$OV_VERSION nodejs/libopenvino_genai.so.$GENAI_VERSION + working-directory: ${{ env.OV_INSTALL_DIR }} + + - name: Pack Node.js bindings libs + run: tar -cvf - * | pigz > ${{ env.BUILD_DIR }}/genai_nodejs_bindings.tar.gz + working-directory: ${{ env.OV_INSTALL_DIR }}/nodejs + + - name: Upload Archive Package with Node.js bindings + if: ${{ always() }} + uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3 + with: + name: genai_nodejs_bindings + path: ${{ env.BUILD_DIR }}/genai_nodejs_bindings.tar.gz + if-no-files-found: 'error' + + - name: Run npm package tests + working-directory: ${{ env.SRC_DIR }}/src/js + run: | + cp -R ${{ env.OV_INSTALL_DIR }}/nodejs bin + npm install + npm test + + - name: Install genai-node samples dependencies + run: npm install + working-directory: ${{ env.SRC_DIR }}/samples/js/text_generation + + - name: Run samples tests + run: npm test + env: + MODEL_PATH: ${{ env.SRC_DIR }}/src/js/tests/models/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov + working-directory: ${{ env.SRC_DIR }}/samples/js/text_generation + Overall_Status: name: ci/gha_overall_status_linux - needs: [openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples, genai_tests_wheel, genai_samples_tests] + needs: [openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples, genai_tests_wheel, genai_samples_tests, genai_build_nodejs_bindings] if: ${{ always() }} runs-on: ubuntu-latest steps: diff --git a/CMakeLists.txt b/CMakeLists.txt index bb19676da3..ee1cb70f7a 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -88,7 +88,7 @@ endif() add_subdirectory(thirdparty) add_subdirectory(src) -if(EXISTS "${OpenVINOGenAI_SOURCE_DIR}/samples") +if(EXISTS "${OpenVINOGenAI_SOURCE_DIR}/samples" AND ENABLE_SAMPLES) add_subdirectory(samples) endif() if(EXISTS "${OpenVINOGenAI_SOURCE_DIR}/tools/continuous_batching") @@ -109,6 +109,9 @@ set(CPACK_COMPONENTS_ALL core_genai core_genai_dev cpp_samples_genai licensing_g if(ENABLE_PYTHON) list(APPEND CPACK_COMPONENTS_ALL pygenai_${Python3_VERSION_MAJOR}_${Python3_VERSION_MINOR}) endif() +if(ENABLE_JS) + list(APPEND CPACK_COMPONENTS_ALL genai_node_addon) +endif() if(WIN32 AND NOT DEFINED CPACK_GENERATOR) set(CPACK_GENERATOR "ZIP") endif() diff --git a/cmake/features.cmake b/cmake/features.cmake index 8b2e05472b..3e494e7355 100644 --- a/cmake/features.cmake +++ b/cmake/features.cmake @@ -3,3 +3,10 @@ # option(ENABLE_PYTHON "Enable Python API build" ON) +option(ENABLE_JS "Enable JS API build" OFF) +option(ENABLE_SAMPLES "Enable samples build" ON) + +# Disable building samples for NPM package +if(CPACK_GENERATOR STREQUAL "NPM") + set(ENABLE_SAMPLES OFF) +endif() diff --git a/pyproject.toml b/pyproject.toml index c87ae38253..b54face916 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -47,7 +47,7 @@ find_python3 = true build_args = ["--parallel", "--target", "py_openvino_genai_stub"] install_args = ["--strip"] install_components = ["wheel_genai"] -options = {"BUILD_TOKENIZERS" = "OFF"} +options = {"BUILD_TOKENIZERS" = "OFF", "ENABLE_SAMPLES" = "OFF"} [build-system] requires = [ diff --git a/samples/js/text_generation/.gitignore b/samples/js/text_generation/.gitignore new file mode 100644 index 0000000000..3c3629e647 --- /dev/null +++ b/samples/js/text_generation/.gitignore @@ -0,0 +1 @@ +node_modules diff --git a/samples/js/text_generation/README.md b/samples/js/text_generation/README.md new file mode 100644 index 0000000000..46caba48e3 --- /dev/null +++ b/samples/js/text_generation/README.md @@ -0,0 +1,48 @@ +# JavaScript chat_sample that supports most popular models like LLaMA 3 + +This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `Pipeline.LLMPipeline` and configures it for the chat scenario. + +## Download and convert the model and tokenizers + +To convert model you have to use python package `optimum-intel`. +The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version. + +Install [../../export-requirements.txt](../../export-requirements.txt) to convert a model. + +```sh +pip install --upgrade-strategy eager -r ../../export-requirements.txt +optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0 +``` + +## Run: + +Compile GenAI JavaScript bindings archive first using the instructions in [../../../src/js/README.md](../../../src/js/README.md#build-bindings). + +Run `npm install` in current folder and then run the sample: + +`node chat_sample.js TinyLlama-1.1B-Chat-v1.0` + +Discrete GPUs (dGPUs) usually provide better performance compared to CPUs. It is recommended to run larger models on a dGPU with 32GB+ RAM. For example, the model meta-llama/Llama-2-13b-chat-hf can benefit from being run on a dGPU. Modify the source code to change the device for inference to the GPU. + +See https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md#supported-models for the list of supported models. + +### Troubleshooting + +#### Unicode characters encoding error on Windows + +Example error: +``` +UnicodeEncodeError: 'charmap' codec can't encode character '\u25aa' in position 0: character maps to +``` + +If you encounter the error described in the example when sample is printing output to the Windows console, it is likely due to the default Windows encoding not supporting certain Unicode characters. To resolve this: +1. Enable Unicode characters for Windows cmd - open `Region` settings from `Control panel`. `Administrative`->`Change system locale`->`Beta: Use Unicode UTF-8 for worldwide language support`->`OK`. Reboot. +2. Enable UTF-8 mode by setting environment variable `PYTHONIOENCODING="utf8"`. + +#### Missing chat template + +If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model. +The following template can be used as a default, but it may not work properly with every model: +``` +"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}", +``` diff --git a/samples/js/text_generation/chat_sample.js b/samples/js/text_generation/chat_sample.js new file mode 100644 index 0000000000..cf4c5e7704 --- /dev/null +++ b/samples/js/text_generation/chat_sample.js @@ -0,0 +1,54 @@ +import readline from 'readline'; +import { Pipeline } from 'genai-node'; + +main(); + +function streamer(subword) { + process.stdout.write(subword); +} + +async function main() { + const MODEL_PATH = process.argv[2]; + + if (!MODEL_PATH) { + console.error('Please specify path to model directory\n' + + 'Run command must be: `node chat_sample.js *path_to_model_dir*`'); + process.exit(1); + } + + const device = 'CPU'; // GPU can be used as well + + // Create interface for reading user input from stdin + const rl = readline.createInterface({ + input: process.stdin, + output: process.stdout, + }); + + const pipe = await Pipeline.LLMPipeline(MODEL_PATH, device); + const config = { 'max_new_tokens': 100 }; + + await pipe.startChat(); + promptUser(); + + // Function to prompt the user for input + function promptUser() { + rl.question('question:\n', handleInput); + } + + // Function to handle user input + async function handleInput(input) { + input = input.trim(); + + // Check for exit command + if (!input) { + await pipe.finishChat(); + rl.close(); + process.exit(0); + } + + await pipe.generate(input, config, streamer); + console.log('\n----------'); + + if (!rl.closed) promptUser(); + } +} diff --git a/samples/js/text_generation/package-lock.json b/samples/js/text_generation/package-lock.json new file mode 100644 index 0000000000..fbee0db012 --- /dev/null +++ b/samples/js/text_generation/package-lock.json @@ -0,0 +1,42 @@ +{ + "name": "genai-node-demo", + "version": "1.0.0", + "lockfileVersion": 3, + "requires": true, + "packages": { + "": { + "name": "genai-node-demo", + "version": "1.0.0", + "license": "Apache-2.0", + "devDependencies": { + "genai-node": "../../../src/js/" + }, + "engines": { + "node": ">=21.0.0" + } + }, + "../../../src/js": { + "name": "genai-node", + "version": "2024.5.0-preview", + "dev": true, + "license": "Apache-2.0", + "os": [ + "linux", + "darwin", + "win32" + ], + "devDependencies": { + "@huggingface/hub": "^0.21.0", + "global-agent": "^3.0.0", + "node-fetch": "^3.3.2" + }, + "engines": { + "node": ">=21.0.0" + } + }, + "node_modules/genai-node": { + "resolved": "../../../src/js", + "link": true + } + } +} diff --git a/samples/js/text_generation/package.json b/samples/js/text_generation/package.json new file mode 100644 index 0000000000..24e66a120d --- /dev/null +++ b/samples/js/text_generation/package.json @@ -0,0 +1,15 @@ +{ + "name": "genai-node-demo", + "version": "1.0.0", + "license": "Apache-2.0", + "type": "module", + "devDependencies": { + "genai-node": "../../../src/js/" + }, + "engines": { + "node": ">=21.0.0" + }, + "scripts": { + "test": "node tests/usage.test.js" + } +} diff --git a/samples/js/text_generation/tests/usage.test.js b/samples/js/text_generation/tests/usage.test.js new file mode 100644 index 0000000000..fcd58a0b69 --- /dev/null +++ b/samples/js/text_generation/tests/usage.test.js @@ -0,0 +1,62 @@ +import { env } from 'process'; +import { spawn } from 'child_process'; + +const MODEL_PATH = env.MODEL_PATH; +const prompt = 'Tell me exactly, no changes, print as is: "Hello world"'; + +if (!MODEL_PATH) + throw new Error( + 'Please environment variable MODEL_PATH to the path of the model directory' + ); + +const runTest = async () => { + return new Promise((resolve, reject) => { + const script = spawn('node', ['chat_sample.js', MODEL_PATH]); + let output = ''; + + // Collect output from stdout + script.stdout.on('data', (data) => { + output += data.toString(); + }); + + // Capture errors + script.stderr.on('data', (data) => { + reject(data.toString()); + }); + + // Send input after detecting the question prompt + script.stdout.once('data', (data) => { + if (data.toString().startsWith('question:')) { + script.stdin.write(`${prompt}\n`); // Provide input + script.stdin.end(); // Close stdin to signal EOF + } + }); + + // Check results when the process exits + script.on('close', (code) => { + if (code !== 0) { + return reject(`Process exited with code ${code}`); + } + + // Log the output + console.log(`Result output: ${output}`); + + // Validate the output + if (typeof output == 'string' && output.length > 0) { + resolve('Test passed!'); + } else { + reject('Test failed: Output did not match expected result.'); + } + }); + }); +}; + +runTest() + .then((message) => { + console.log(message); + process.exit(0); + }) + .catch((err) => { + console.error(err); + process.exit(1); + }); diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index 2f615a1b6f..c2ef969838 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -7,3 +7,7 @@ add_subdirectory(cpp) if(ENABLE_PYTHON) add_subdirectory(python) endif() + +if(ENABLE_JS) + add_subdirectory(js) +endif() diff --git a/src/cpp/CMakeLists.txt b/src/cpp/CMakeLists.txt index 43bca747ec..5c50f55268 100644 --- a/src/cpp/CMakeLists.txt +++ b/src/cpp/CMakeLists.txt @@ -147,13 +147,42 @@ if(MSVC OR APPLE) set(ARCH_DIR ${ARCH_DIR}/${CMAKE_BUILD_TYPE}) endif() +# Put binaries at the top level for NPM package +if(CPACK_GENERATOR STREQUAL "NPM") + set(LIBRARY_DESTINATION .) + set(ARCHIVE_DESTINATION .) + set(RUNTIME_DESTINATION .) + + # setting RPATH / LC_RPATH depending on platform + if(LINUX) + # to find libopenvino.so in the same folder + set(rpaths "$ORIGIN") + elseif(APPLE) + # to find libopenvino.dylib in the same folder + set(rpaths "@loader_path") + endif() + + if(rpaths) + set_target_properties(${TARGET_NAME} PROPERTIES INSTALL_RPATH "${rpaths}") + endif() +else() + set(LIBRARY_DESTINATION runtime/lib/${ARCH_DIR}) + set(ARCHIVE_DESTINATION runtime/lib/${ARCH_DIR}) + set(RUNTIME_DESTINATION runtime/bin/${ARCH_DIR}) +endif() + install(TARGETS ${TARGET_NAME} EXPORT OpenVINOGenAITargets - LIBRARY DESTINATION runtime/lib/${ARCH_DIR} COMPONENT core_genai + LIBRARY DESTINATION ${LIBRARY_DESTINATION} COMPONENT core_genai NAMELINK_COMPONENT core_genai_dev - ARCHIVE DESTINATION runtime/lib/${ARCH_DIR} COMPONENT core_genai_dev - RUNTIME DESTINATION runtime/bin/${ARCH_DIR} COMPONENT core_genai + ARCHIVE DESTINATION ${ARCHIVE_DESTINATION} COMPONENT core_genai_dev + RUNTIME DESTINATION ${RUNTIME_DESTINATION} COMPONENT core_genai INCLUDES DESTINATION runtime/include) +# development files do not need to be built for NPM package +if(CPACK_GENERATOR STREQUAL "NPM") + return() +endif() + install(DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/include/ DESTINATION runtime/include COMPONENT core_genai_dev) install(FILES ${CMAKE_CURRENT_BINARY_DIR}/openvino/genai/version.hpp diff --git a/src/js/.gitignore b/src/js/.gitignore new file mode 100644 index 0000000000..8990d8c418 --- /dev/null +++ b/src/js/.gitignore @@ -0,0 +1,6 @@ +.vscode +bin +bin.* +build +node_modules +tests/models diff --git a/src/js/.npmignore b/src/js/.npmignore new file mode 100644 index 0000000000..9bf3e571b1 --- /dev/null +++ b/src/js/.npmignore @@ -0,0 +1,15 @@ +.vscode +bin.* +build +include +src +tests + +.eslintrc.js +CMakeLists.txt +tsconfig.json +TODO.md +build.sh + +**/*.tsbuildinfo +*.tgz diff --git a/src/js/CMakeLists.txt b/src/js/CMakeLists.txt new file mode 100644 index 0000000000..7e4ff0bea4 --- /dev/null +++ b/src/js/CMakeLists.txt @@ -0,0 +1,93 @@ +cmake_minimum_required(VERSION 3.18) + +# Set C++ standard +set(CMAKE_CXX_STANDARD 17) + +set(dist_folder "${CMAKE_SOURCE_DIR}/bin/") + +if(WIN32) + set(CMAKE_SHARED_LINKER_FLAGS /DELAYLOAD:NODE.EXE) + set(CMAKE_JS_LIB ${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/node.lib) + set(CMAKE_JS_SRC ${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/win_delay_load_hook.cc) + + set(CMAKE_JS_NODELIB_DEF ${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/node-lib.def) + set(CMAKE_JS_NODELIB_TARGET ${CMAKE_JS_LIB}) + set(DELAYIMP_LIB delayimp.lib) +endif() + +project(genai_node_addon) + +# Specify NAPI version 8 +# supports v12.22.0+, v14.17.0+, v15.12.0+, 16.0.0 and all later Node.js versions +add_definitions(-DNAPI_VERSION=8) + +include(FetchContent) + +FetchContent_Declare( + node-api-headers + URL https://github.com/nodejs/node-api-headers/archive/refs/tags/v1.1.0.tar.gz + URL_HASH SHA256=70608bc1e6dddce280285f3462f18a106f687c0720a4b90893e1ecd86e5a8bbf +) +FetchContent_MakeAvailable(node-api-headers) + +FetchContent_Declare( + node-addon-api + URL https://github.com/nodejs/node-addon-api/archive/refs/tags/v8.0.0.tar.gz + URL_HASH SHA256=42424c5206b9d67b41af4fcff5d6e3cb22074168035a03b8467852938a281d47 +) +FetchContent_MakeAvailable(node-addon-api) + +# Create a library +add_library(${PROJECT_NAME} SHARED + ${CMAKE_CURRENT_SOURCE_DIR}/src/addon.cpp + ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/llm_pipeline_wrapper.cpp + ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/finish_chat_worker.cpp + ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/start_chat_worker.cpp + ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/init_worker.cpp + ${CMAKE_CURRENT_SOURCE_DIR}/src/helper.cpp + + ${CMAKE_JS_SRC} +) + +# Include directories +target_include_directories(${PROJECT_NAME} PRIVATE + "${node-api-headers_SOURCE_DIR}/include" + "${node-addon-api_SOURCE_DIR}" + "${CMAKE_CURRENT_SOURCE_DIR}" +) + +target_link_libraries(${PROJECT_NAME} PRIVATE openvino::genai ${DELAYIMP_LIB} ${CMAKE_JS_LIB}) + +if(MSVC AND CMAKE_JS_NODELIB_DEF AND CMAKE_JS_NODELIB_TARGET) # Generate node.lib + execute_process(COMMAND ${CMAKE_AR} /def:${CMAKE_JS_NODELIB_DEF} /out:${CMAKE_JS_NODELIB_TARGET} ${CMAKE_STATIC_LINKER_FLAGS}) +endif() + +if(APPLE) + target_link_options(${PROJECT_NAME} PRIVATE -Wl,-undefined,suppress,-flat_namespace) +elseif(AARCH64 OR ARM) + target_link_options(${PROJECT_NAME} PRIVATE -Wl,--unresolved-symbols=ignore-all) +endif() + +# Set library properties +set_target_properties(${PROJECT_NAME} PROPERTIES + PREFIX "" + SUFFIX ".node" +) + +# setting RPATH / LC_RPATH depending on platform +if(LINUX) + # to find libopenvino_genai.so in the same folder + set(rpaths "$ORIGIN") +elseif(APPLE) + # to find libopenvino_genai.dylib in the same folder + set(rpaths "@loader_path") +endif() + +if(rpaths) + set_target_properties(${PROJECT_NAME} PROPERTIES INSTALL_RPATH "${rpaths}") +endif() + +install(TARGETS ${PROJECT_NAME} + LIBRARY DESTINATION . COMPONENT ${PROJECT_NAME} + RUNTIME DESTINATION . COMPONENT ${PROJECT_NAME} +) diff --git a/src/js/README.md b/src/js/README.md new file mode 100644 index 0000000000..f5ccf1117c --- /dev/null +++ b/src/js/README.md @@ -0,0 +1,56 @@ +# OpenVINO™ GenAI Node.js bindings (preview) + +## DISCLAIMER + +This is preview version, do not use it in production! + +## Install and Run + +### Requirements + +- Node.js v21+ +- Tested on Ubuntu, another OS didn't tested yet + +### Build Bindings + +#### Build OpenVINO GenAI as OpenVINO Extra Module + +OpenVINO GenAI Node.js bindings can be built as an extra module during the OpenVINO build process. This method simplifies the build process by integrating OpenVINO GenAI directly into the OpenVINO build. + +1. Clone OpenVINO repository: + ```sh + git clone --recursive https://github.com/openvinotoolkit/openvino.git + ``` +1. Configure CMake with OpenVINO extra modules: + ```sh + cmake -DOPENVINO_EXTRA_MODULES=*path to genai repository directory* -DCPACK_ARCHIVE_COMPONENT_INSTALL=OFF \ + -DCPACK_GENERATOR=NPM \ + -DENABLE_PYTHON=OFF \ + -DENABLE_WHEEL=OFF \ + -DCPACK_PACKAGE_FILE_NAME=genai_nodejs_bindings \ + -S ./openvino -B ./build + ``` +1. Build OpenVINO archive with GenAI: + ```sh + cmake --build ./build --target package -j + ``` + +1. Put Node.js bindings into npm package `bin` directory and install dependencies: + ```sh + mkdir ./src/js/bin/ + tar -xvf ./build/genai_nodejs_bindings.tar.gz --directory ./src/js/bin/ + cd ./src/js/ + npm install + ``` +1. Run tests to be sure that everything works: + ```sh + npm test + ``` + +### Using as npm Dependency + +To use this package locally use `npm link` in `src/js/` directory +and `npm link genai-node` in the folder where you want to add this package as a dependency + +To extract this package and use it as distributed npm package run `npm package`. +This command creates archive that you may use in your projects. diff --git a/src/js/include/addon.hpp b/src/js/include/addon.hpp new file mode 100644 index 0000000000..35e5cc462e --- /dev/null +++ b/src/js/include/addon.hpp @@ -0,0 +1,20 @@ +// Copyright (C) 2018-2024 Intel Corporation +// SPDX-License-Identifier: Apache-2.0 + +#pragma once + +#include + +typedef Napi::Function (*Prototype)(Napi::Env); + +struct AddonData { + Napi::FunctionReference core; +}; + +void init_class(Napi::Env env, + Napi::Object exports, + std::string class_name, + Prototype func, + Napi::FunctionReference& reference); + +Napi::Object init_module(Napi::Env env, Napi::Object exports); diff --git a/src/js/include/helper.hpp b/src/js/include/helper.hpp new file mode 100644 index 0000000000..4a010df019 --- /dev/null +++ b/src/js/include/helper.hpp @@ -0,0 +1,23 @@ +#pragma once +#include + +#include "openvino/core/type/element_type.hpp" +#include "openvino/openvino.hpp" + +ov::AnyMap to_anyMap(const Napi::Env&, const Napi::Value&); + +/** + * @brief Template function to convert Javascript data types into C++ data types + * @tparam TargetType destinated C++ data type + * @param info Napi::CallbackInfo contains all arguments passed to a function or method + * @param idx specifies index of a argument inside info. + * @return specified argument converted to a TargetType. + */ +template +TargetType js_to_cpp(const Napi::Env& env, const Napi::Value& value); + +/** @brief A template specialization for TargetType ov::Any */ +template <> +ov::Any js_to_cpp(const Napi::Env& env, const Napi::Value& value); + +bool is_napi_value_int(const Napi::Env& env, const Napi::Value& num); diff --git a/src/js/include/llm_pipeline/finish_chat_worker.hpp b/src/js/include/llm_pipeline/finish_chat_worker.hpp new file mode 100644 index 0000000000..ca80b30aff --- /dev/null +++ b/src/js/include/llm_pipeline/finish_chat_worker.hpp @@ -0,0 +1,18 @@ +#pragma once + +#include +#include "openvino/genai/llm_pipeline.hpp" + +using namespace Napi; + +class FinishChatWorker : public AsyncWorker { + public: + FinishChatWorker(Function& callback, std::shared_ptr& pipe); + virtual ~FinishChatWorker(){}; + + void Execute(); + void OnOK(); + + private: + std::shared_ptr& pipe; +}; diff --git a/src/js/include/llm_pipeline/init_worker.hpp b/src/js/include/llm_pipeline/init_worker.hpp new file mode 100644 index 0000000000..5fc05969fb --- /dev/null +++ b/src/js/include/llm_pipeline/init_worker.hpp @@ -0,0 +1,21 @@ +#pragma once + +#include +#include "openvino/genai/llm_pipeline.hpp" + +using namespace Napi; + +class InitWorker : public AsyncWorker { + public: + InitWorker(Function& callback, std::shared_ptr& pipe, + const std::string model_path, std::string device); + virtual ~InitWorker(){}; + + void Execute(); + void OnOK(); + + private: + std::shared_ptr& pipe; + std::string model_path; + std::string device; +}; diff --git a/src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp b/src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp new file mode 100644 index 0000000000..872e9ea023 --- /dev/null +++ b/src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp @@ -0,0 +1,27 @@ +#pragma once + +#include +#include +#include "openvino/genai/llm_pipeline.hpp" + +class LLMPipelineWrapper : public Napi::ObjectWrap { +public: + LLMPipelineWrapper(const Napi::CallbackInfo& info); + + static Napi::Function get_class(Napi::Env env); + + Napi::Value init(const Napi::CallbackInfo& info); + Napi::Value generate(const Napi::CallbackInfo& info); + Napi::Value start_chat(const Napi::CallbackInfo& info); + Napi::Value finish_chat(const Napi::CallbackInfo& info); +private: + bool is_loaded = false; + bool is_initialized = false; + bool is_running = false; + + std::string model_path; + std::string device; + + std::shared_ptr pipe = nullptr; + std::function streamer; +}; diff --git a/src/js/include/llm_pipeline/start_chat_worker.hpp b/src/js/include/llm_pipeline/start_chat_worker.hpp new file mode 100644 index 0000000000..fde0cfaa0a --- /dev/null +++ b/src/js/include/llm_pipeline/start_chat_worker.hpp @@ -0,0 +1,18 @@ +#pragma once + +#include +#include "openvino/genai/llm_pipeline.hpp" + +using namespace Napi; + +class StartChatWorker : public AsyncWorker { + public: + StartChatWorker(Function& callback, std::shared_ptr& pipe); + virtual ~StartChatWorker(){}; + + void Execute(); + void OnOK(); + + private: + std::shared_ptr& pipe; +}; diff --git a/src/js/lib/bindings.cjs b/src/js/lib/bindings.cjs new file mode 100644 index 0000000000..acd9e590b8 --- /dev/null +++ b/src/js/lib/bindings.cjs @@ -0,0 +1 @@ +module.exports = require('../bin/genai_node_addon.node'); diff --git a/src/js/lib/module.js b/src/js/lib/module.js new file mode 100644 index 0000000000..6595ba0de0 --- /dev/null +++ b/src/js/lib/module.js @@ -0,0 +1,141 @@ +import util from 'node:util'; + +import addon from './bindings.cjs'; + +class LLMPipeline { + modelPath = null; + device = null; + pipeline = null; + isInitialized = false; + isChatStarted = false; + + constructor(modelPath, device) { + this.modelPath = modelPath; + this.device = device; + } + + async init() { + if (this.isInitialized) + throw new Error('Pipeline is already initialized'); + + this.pipeline = new addon.LLMPipeline(); + + const init = util.promisify(this.pipeline.init.bind(this.pipeline)); + const result = await init(this.modelPath, this.device); + + this.isInitialized = true; + + return result; + } + + async startChat() { + if (this.isChatStarted) + throw new Error('Chat is already started'); + + const startChatPromise = util.promisify( + this.pipeline.startChat.bind(this.pipeline) + ); + const result = await startChatPromise(); + + this.isChatStarted = true; + + return result; + } + async finishChat() { + if (!this.isChatStarted) + throw new Error('Chat is not started'); + + const finishChatPromise = util.promisify( + this.pipeline.finishChat.bind(this.pipeline) + ); + const result = await finishChatPromise(); + + this.isChatStarted = false; + + return result; + } + + static castOptionsToString(options) { + const castedOptions = {}; + + for (const key in options) + castedOptions[key] = String(options[key]); + + return castedOptions; + } + + getAsyncGenerator(prompt, generationOptions = {}) { + if (!this.isInitialized) + throw new Error('Pipeline is not initialized'); + + if (typeof prompt !== 'string') + throw new Error('Prompt must be a string'); + if (typeof generationOptions !== 'object') + throw new Error('Options must be an object'); + + const castedOptions = LLMPipeline.castOptionsToString(generationOptions); + + const queue = []; + let resolvePromise; + + // Callback function that C++ will call when a chunk is ready + function chunkOutput(isDone, subword) { + if (resolvePromise) { + resolvePromise({ value: subword, done: isDone }); // Fulfill pending request + resolvePromise = null; // Reset promise resolver + } else { + queue.push({ isDone, subword }); // Add data to queue if no pending promise + } + } + + this.pipeline.generate(prompt, chunkOutput, castedOptions); + + return { + async next() { + // If there is data in the queue, return it + // Otherwise, return a promise that will resolve when data is available + if (queue.length > 0) { + const { isDone, subword } = queue.shift(); + + return { value: subword, done: isDone }; + } + + return new Promise((resolve) => (resolvePromise = resolve)); + }, + [Symbol.asyncIterator]() { return this; } + }; + } + + async generate(prompt, generationOptions, generationCallback) { + const options = generationOptions || {}; + + if (generationCallback !== undefined && typeof generationCallback !== 'function') + throw new Error('Generation callback must be a function'); + + const g = this.getAsyncGenerator(prompt, options); + const result = []; + + for await (const chunk of g) { + result.push(chunk); + + if (generationCallback) generationCallback(chunk); + } + + return result.join(''); + } +} + +class Pipeline { + static async LLMPipeline(modelPath, device = 'CPU') { + const pipeline = new LLMPipeline(modelPath, device); + await pipeline.init(); + + return pipeline; + } +} + + +export { + addon, + Pipeline, +}; diff --git a/src/js/package-lock.json b/src/js/package-lock.json new file mode 100644 index 0000000000..4da5b57ea7 --- /dev/null +++ b/src/js/package-lock.json @@ -0,0 +1,470 @@ +{ + "name": "genai-node", + "version": "2024.5.0-preview", + "lockfileVersion": 3, + "requires": true, + "packages": { + "": { + "name": "genai-node", + "version": "2024.5.0-preview", + "license": "Apache-2.0", + "os": [ + "linux", + "darwin", + "win32" + ], + "devDependencies": { + "@huggingface/hub": "^0.21.0", + "global-agent": "^3.0.0", + "node-fetch": "^3.3.2" + }, + "engines": { + "node": ">=21.0.0" + } + }, + "node_modules/@huggingface/hub": { + "version": "0.21.0", + "resolved": "https://registry.npmjs.org/@huggingface/hub/-/hub-0.21.0.tgz", + "integrity": "sha512-DpitNhqobMJLTv8dUq/EMtrz1dpfs3UrSVCxe1aKpjLAdOs6Gm6rqrinUFNvC9G88RIRzIYzojUtYUqlkKwKnA==", + "dev": true, + "license": "MIT", + "dependencies": { + "@huggingface/tasks": "^0.13.3" + }, + "engines": { + "node": ">=18" + } + }, + "node_modules/@huggingface/tasks": { + "version": "0.13.4", + "resolved": "https://registry.npmjs.org/@huggingface/tasks/-/tasks-0.13.4.tgz", + "integrity": "sha512-LETHbMSK3gHBFU0D09ziEJm6t1Pcgii4SFwHw+d+8MFGfkAryxaDl2qaHY+PxiTkZEeaTLd6G8/239SJuVxyWg==", + "dev": true, + "license": "MIT" + }, + "node_modules/boolean": { + "version": "3.2.0", + "resolved": "https://registry.npmjs.org/boolean/-/boolean-3.2.0.tgz", + "integrity": "sha512-d0II/GO9uf9lfUHH2BQsjxzRJZBdsjgsBiW4BvhWk/3qoKwQFjIDVN19PfX8F2D/r9PCMTtLWjYVCFrpeYUzsw==", + "deprecated": "Package no longer supported. Contact Support at https://www.npmjs.com/support for more info.", + "dev": true, + "license": "MIT" + }, + "node_modules/data-uri-to-buffer": { + "version": "4.0.1", + "resolved": "https://registry.npmjs.org/data-uri-to-buffer/-/data-uri-to-buffer-4.0.1.tgz", + "integrity": "sha512-0R9ikRb668HB7QDxT1vkpuUBtqc53YyAwMwGeUFKRojY/NWKvdZ+9UYtRfGmhqNbRkTSVpMbmyhXipFFv2cb/A==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 12" + } + }, + "node_modules/define-data-property": { + "version": "1.1.4", + "resolved": "https://registry.npmjs.org/define-data-property/-/define-data-property-1.1.4.tgz", + "integrity": "sha512-rBMvIzlpA8v6E+SJZoo++HAYqsLrkg7MSfIinMPFhmkorw7X+dOXVJQs+QT69zGkzMyfDnIMN2Wid1+NbL3T+A==", + "dev": true, + "license": "MIT", + "dependencies": { + "es-define-property": "^1.0.0", + "es-errors": "^1.3.0", + "gopd": "^1.0.1" + }, + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/define-properties": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/define-properties/-/define-properties-1.2.1.tgz", + "integrity": "sha512-8QmQKqEASLd5nx0U1B1okLElbUuuttJ/AnYmRXbbbGDWh6uS208EjD4Xqq/I9wK7u0v6O08XhTWnt5XtEbR6Dg==", + "dev": true, + "license": "MIT", + "dependencies": { + "define-data-property": "^1.0.1", + "has-property-descriptors": "^1.0.0", + "object-keys": "^1.1.1" + }, + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/detect-node": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/detect-node/-/detect-node-2.1.0.tgz", + "integrity": "sha512-T0NIuQpnTvFDATNuHN5roPwSBG83rFsuO+MXXH9/3N1eFbn4wcPjttvjMLEPWJ0RGUYgQE7cGgS3tNxbqCGM7g==", + "dev": true, + "license": "MIT" + }, + "node_modules/es-define-property": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/es-define-property/-/es-define-property-1.0.0.tgz", + "integrity": "sha512-jxayLKShrEqqzJ0eumQbVhTYQM27CfT1T35+gCgDFoL82JLsXqTJ76zv6A0YLOgEnLUMvLzsDsGIrl8NFpT2gQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "get-intrinsic": "^1.2.4" + }, + "engines": { + "node": ">= 0.4" + } + }, + "node_modules/es-errors": { + "version": "1.3.0", + "resolved": "https://registry.npmjs.org/es-errors/-/es-errors-1.3.0.tgz", + "integrity": "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.4" + } + }, + "node_modules/es6-error": { + "version": "4.1.1", + "resolved": "https://registry.npmjs.org/es6-error/-/es6-error-4.1.1.tgz", + "integrity": "sha512-Um/+FxMr9CISWh0bi5Zv0iOD+4cFh5qLeks1qhAopKVAJw3drgKbKySikp7wGhDL0HPeaja0P5ULZrxLkniUVg==", + "dev": true, + "license": "MIT" + }, + "node_modules/escape-string-regexp": { + "version": "4.0.0", + "resolved": "https://registry.npmjs.org/escape-string-regexp/-/escape-string-regexp-4.0.0.tgz", + "integrity": "sha512-TtpcNJ3XAzx3Gq8sWRzJaVajRs0uVxA2YAkdb1jm2YkPz4G6egUFAyA3n5vtEIZefPk5Wa4UXbKuS5fKkJWdgA==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/fetch-blob": { + "version": "3.2.0", + "resolved": "https://registry.npmjs.org/fetch-blob/-/fetch-blob-3.2.0.tgz", + "integrity": "sha512-7yAQpD2UMJzLi1Dqv7qFYnPbaPx7ZfFK6PiIxQ4PfkGPyNyl2Ugx+a/umUonmKqjhM4DnfbMvdX6otXq83soQQ==", + "dev": true, + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/jimmywarting" + }, + { + "type": "paypal", + "url": "https://paypal.me/jimmywarting" + } + ], + "license": "MIT", + "dependencies": { + "node-domexception": "^1.0.0", + "web-streams-polyfill": "^3.0.3" + }, + "engines": { + "node": "^12.20 || >= 14.13" + } + }, + "node_modules/formdata-polyfill": { + "version": "4.0.10", + "resolved": "https://registry.npmjs.org/formdata-polyfill/-/formdata-polyfill-4.0.10.tgz", + "integrity": "sha512-buewHzMvYL29jdeQTVILecSaZKnt/RJWjoZCF5OW60Z67/GmSLBkOFM7qh1PI3zFNtJbaZL5eQu1vLfazOwj4g==", + "dev": true, + "license": "MIT", + "dependencies": { + "fetch-blob": "^3.1.2" + }, + "engines": { + "node": ">=12.20.0" + } + }, + "node_modules/function-bind": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/function-bind/-/function-bind-1.1.2.tgz", + "integrity": "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA==", + "dev": true, + "license": "MIT", + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/get-intrinsic": { + "version": "1.2.4", + "resolved": "https://registry.npmjs.org/get-intrinsic/-/get-intrinsic-1.2.4.tgz", + "integrity": "sha512-5uYhsJH8VJBTv7oslg4BznJYhDoRI6waYCxMmCdnTrcCrHA/fCFKoTFz2JKKE0HdDFUF7/oQuhzumXJK7paBRQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "es-errors": "^1.3.0", + "function-bind": "^1.1.2", + "has-proto": "^1.0.1", + "has-symbols": "^1.0.3", + "hasown": "^2.0.0" + }, + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/global-agent": { + "version": "3.0.0", + "resolved": "https://registry.npmjs.org/global-agent/-/global-agent-3.0.0.tgz", + "integrity": "sha512-PT6XReJ+D07JvGoxQMkT6qji/jVNfX/h364XHZOWeRzy64sSFr+xJ5OX7LI3b4MPQzdL4H8Y8M0xzPpsVMwA8Q==", + "dev": true, + "license": "BSD-3-Clause", + "dependencies": { + "boolean": "^3.0.1", + "es6-error": "^4.1.1", + "matcher": "^3.0.0", + "roarr": "^2.15.3", + "semver": "^7.3.2", + "serialize-error": "^7.0.1" + }, + "engines": { + "node": ">=10.0" + } + }, + "node_modules/globalthis": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/globalthis/-/globalthis-1.0.4.tgz", + "integrity": "sha512-DpLKbNU4WylpxJykQujfCcwYWiV/Jhm50Goo0wrVILAv5jOr9d+H+UR3PhSCD2rCCEIg0uc+G+muBTwD54JhDQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "define-properties": "^1.2.1", + "gopd": "^1.0.1" + }, + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/gopd": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/gopd/-/gopd-1.0.1.tgz", + "integrity": "sha512-d65bNlIadxvpb/A2abVdlqKqV563juRnZ1Wtk6s1sIR8uNsXR70xqIzVqxVf1eTqDunwT2MkczEeaezCKTZhwA==", + "dev": true, + "license": "MIT", + "dependencies": { + "get-intrinsic": "^1.1.3" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/has-property-descriptors": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/has-property-descriptors/-/has-property-descriptors-1.0.2.tgz", + "integrity": "sha512-55JNKuIW+vq4Ke1BjOTjM2YctQIvCT7GFzHwmfZPGo5wnrgkid0YQtnAleFSqumZm4az3n2BS+erby5ipJdgrg==", + "dev": true, + "license": "MIT", + "dependencies": { + "es-define-property": "^1.0.0" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/has-proto": { + "version": "1.0.3", + "resolved": "https://registry.npmjs.org/has-proto/-/has-proto-1.0.3.tgz", + "integrity": "sha512-SJ1amZAJUiZS+PhsVLf5tGydlaVB8EdFpaSO4gmiUKUOxk8qzn5AIy4ZeJUmh22znIdk/uMAUT2pl3FxzVUH+Q==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/has-symbols": { + "version": "1.0.3", + "resolved": "https://registry.npmjs.org/has-symbols/-/has-symbols-1.0.3.tgz", + "integrity": "sha512-l3LCuF6MgDNwTDKkdYGEihYjt5pRPbEg46rtlmnSPlUbgmB8LOIrKJbYYFBSbnPaJexMKtiPO8hmeRjRz2Td+A==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/hasown": { + "version": "2.0.2", + "resolved": "https://registry.npmjs.org/hasown/-/hasown-2.0.2.tgz", + "integrity": "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "function-bind": "^1.1.2" + }, + "engines": { + "node": ">= 0.4" + } + }, + "node_modules/json-stringify-safe": { + "version": "5.0.1", + "resolved": "https://registry.npmjs.org/json-stringify-safe/-/json-stringify-safe-5.0.1.tgz", + "integrity": "sha512-ZClg6AaYvamvYEE82d3Iyd3vSSIjQ+odgjaTzRuO3s7toCdFKczob2i0zCh7JE8kWn17yvAWhUVxvqGwUalsRA==", + "dev": true, + "license": "ISC" + }, + "node_modules/matcher": { + "version": "3.0.0", + "resolved": "https://registry.npmjs.org/matcher/-/matcher-3.0.0.tgz", + "integrity": "sha512-OkeDaAZ/bQCxeFAozM55PKcKU0yJMPGifLwV4Qgjitu+5MoAfSQN4lsLJeXZ1b8w0x+/Emda6MZgXS1jvsapng==", + "dev": true, + "license": "MIT", + "dependencies": { + "escape-string-regexp": "^4.0.0" + }, + "engines": { + "node": ">=10" + } + }, + "node_modules/node-domexception": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/node-domexception/-/node-domexception-1.0.0.tgz", + "integrity": "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ==", + "dev": true, + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/jimmywarting" + }, + { + "type": "github", + "url": "https://paypal.me/jimmywarting" + } + ], + "license": "MIT", + "engines": { + "node": ">=10.5.0" + } + }, + "node_modules/node-fetch": { + "version": "3.3.2", + "resolved": "https://registry.npmjs.org/node-fetch/-/node-fetch-3.3.2.tgz", + "integrity": "sha512-dRB78srN/l6gqWulah9SrxeYnxeddIG30+GOqK/9OlLVyLg3HPnr6SqOWTWOXKRwC2eGYCkZ59NNuSgvSrpgOA==", + "dev": true, + "license": "MIT", + "dependencies": { + "data-uri-to-buffer": "^4.0.0", + "fetch-blob": "^3.1.4", + "formdata-polyfill": "^4.0.10" + }, + "engines": { + "node": "^12.20.0 || ^14.13.1 || >=16.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/node-fetch" + } + }, + "node_modules/object-keys": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/object-keys/-/object-keys-1.1.1.tgz", + "integrity": "sha512-NuAESUOUMrlIXOfHKzD6bpPu3tYt3xvjNdRIQ+FeT0lNb4K8WR70CaDxhuNguS2XG+GjkyMwOzsN5ZktImfhLA==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.4" + } + }, + "node_modules/roarr": { + "version": "2.15.4", + "resolved": "https://registry.npmjs.org/roarr/-/roarr-2.15.4.tgz", + "integrity": "sha512-CHhPh+UNHD2GTXNYhPWLnU8ONHdI+5DI+4EYIAOaiD63rHeYlZvyh8P+in5999TTSFgUYuKUAjzRI4mdh/p+2A==", + "dev": true, + "license": "BSD-3-Clause", + "dependencies": { + "boolean": "^3.0.1", + "detect-node": "^2.0.4", + "globalthis": "^1.0.1", + "json-stringify-safe": "^5.0.1", + "semver-compare": "^1.0.0", + "sprintf-js": "^1.1.2" + }, + "engines": { + "node": ">=8.0" + } + }, + "node_modules/semver": { + "version": "7.6.3", + "resolved": "https://registry.npmjs.org/semver/-/semver-7.6.3.tgz", + "integrity": "sha512-oVekP1cKtI+CTDvHWYFUcMtsK/00wmAEfyqKfNdARm8u1wNVhSgaX7A8d4UuIlUI5e84iEwOhs7ZPYRmzU9U6A==", + "dev": true, + "license": "ISC", + "bin": { + "semver": "bin/semver.js" + }, + "engines": { + "node": ">=10" + } + }, + "node_modules/semver-compare": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/semver-compare/-/semver-compare-1.0.0.tgz", + "integrity": "sha512-YM3/ITh2MJ5MtzaM429anh+x2jiLVjqILF4m4oyQB18W7Ggea7BfqdH/wGMK7dDiMghv/6WG7znWMwUDzJiXow==", + "dev": true, + "license": "MIT" + }, + "node_modules/serialize-error": { + "version": "7.0.1", + "resolved": "https://registry.npmjs.org/serialize-error/-/serialize-error-7.0.1.tgz", + "integrity": "sha512-8I8TjW5KMOKsZQTvoxjuSIa7foAwPWGOts+6o7sgjz41/qMD9VQHEDxi6PBvK2l0MXUmqZyNpUK+T2tQaaElvw==", + "dev": true, + "license": "MIT", + "dependencies": { + "type-fest": "^0.13.1" + }, + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/sprintf-js": { + "version": "1.1.3", + "resolved": "https://registry.npmjs.org/sprintf-js/-/sprintf-js-1.1.3.tgz", + "integrity": "sha512-Oo+0REFV59/rz3gfJNKQiBlwfHaSESl1pcGyABQsnnIfWOFt6JNj5gCog2U6MLZ//IGYD+nA8nI+mTShREReaA==", + "dev": true, + "license": "BSD-3-Clause" + }, + "node_modules/type-fest": { + "version": "0.13.1", + "resolved": "https://registry.npmjs.org/type-fest/-/type-fest-0.13.1.tgz", + "integrity": "sha512-34R7HTnG0XIJcBSn5XhDd7nNFPRcXYRZrBB2O2jdKqYODldSzBAqzsWoZYYvduky73toYS/ESqxPvkDf/F0XMg==", + "dev": true, + "license": "(MIT OR CC0-1.0)", + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/web-streams-polyfill": { + "version": "3.3.3", + "resolved": "https://registry.npmjs.org/web-streams-polyfill/-/web-streams-polyfill-3.3.3.tgz", + "integrity": "sha512-d2JWLCivmZYTSIoge9MsgFCZrt571BikcWGYkjC1khllbTeDlGqZ2D8vD8E/lJa8WGWbb7Plm8/XJYV7IJHZZw==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 8" + } + } + } +} diff --git a/src/js/package.json b/src/js/package.json new file mode 100644 index 0000000000..5b069fb01f --- /dev/null +++ b/src/js/package.json @@ -0,0 +1,30 @@ +{ + "name": "genai-node", + "type": "module", + "version": "2024.5.0-preview", + "description": "OpenVINO™ GenAI pipelines for using from Node.js environment", + "license": "Apache-2.0", + "main": "./lib/module.js", + "os": [ + "linux", + "darwin", + "win32" + ], + "engines": { + "node": ">=21.0.0" + }, + "keywords": [ + "OpenVINO", + "OpenVINO GenAI", + "GenAI" + ], + "scripts": { + "test_setup": "node ./tests/setup.js", + "test": "npm run test_setup && node --test ./tests/*.test.js" + }, + "devDependencies": { + "node-fetch": "^3.3.2", + "global-agent": "^3.0.0", + "@huggingface/hub": "^0.21.0" + } +} diff --git a/src/js/src/addon.cpp b/src/js/src/addon.cpp new file mode 100644 index 0000000000..4bd1da7bb6 --- /dev/null +++ b/src/js/src/addon.cpp @@ -0,0 +1,30 @@ +#include +#include + +#include "include/addon.hpp" + +#include "include/llm_pipeline/llm_pipeline_wrapper.hpp" + +void init_class(Napi::Env env, + Napi::Object exports, + std::string class_name, + Prototype func, + Napi::FunctionReference& reference) { + const auto& prototype = func(env); + + reference = Napi::Persistent(prototype); + exports.Set(class_name, prototype); +} + +// Define the addon initialization function +Napi::Object init_module(Napi::Env env, Napi::Object exports) { + auto addon_data = new AddonData(); + env.SetInstanceData(addon_data); + + init_class(env, exports, "LLMPipeline", &LLMPipelineWrapper::get_class, addon_data->core); + + return exports; +} + +// Register the addon with Node.js +NODE_API_MODULE(genai-node, init_module) diff --git a/src/js/src/helper.cpp b/src/js/src/helper.cpp new file mode 100644 index 0000000000..106994603b --- /dev/null +++ b/src/js/src/helper.cpp @@ -0,0 +1,53 @@ +#include "include/helper.hpp" + +ov::AnyMap to_anyMap(const Napi::Env& env, const Napi::Value& val) { + ov::AnyMap properties; + if (!val.IsObject()) { + OPENVINO_THROW("Passed Napi::Value must be an object."); + } + const auto& parameters = val.ToObject(); + const auto& keys = parameters.GetPropertyNames(); + + for (uint32_t i = 0; i < keys.Length(); ++i) { + const auto& property_name = static_cast(keys[i]).ToString().Utf8Value(); + + const auto& any_value = js_to_cpp(env, parameters.Get(property_name)); + + properties.insert(std::make_pair(property_name, any_value)); + } + + return properties; +} + +template <> +ov::Any js_to_cpp(const Napi::Env& env, const Napi::Value& value) { + if (value.IsString()) { + return ov::Any(value.ToString().Utf8Value()); + } else if (value.IsBigInt()) { + Napi::BigInt big_value = value.As(); + bool is_lossless; + int64_t big_num = big_value.Int64Value(&is_lossless); + + if (!is_lossless) { + OPENVINO_THROW("Result of BigInt conversion to int64_t results in a loss of precision"); + } + + return ov::Any(big_num); + } else if (value.IsNumber()) { + Napi::Number num = value.ToNumber(); + + if (is_napi_value_int(env, value)) { + return ov::Any(num.Int32Value()); + } else { + return ov::Any(num.DoubleValue()); + } + } else if (value.IsBoolean()) { + return ov::Any(value.ToBoolean()); + } else { + OPENVINO_THROW("Cannot convert to ov::Any"); + } +} + +bool is_napi_value_int(const Napi::Env& env, const Napi::Value& num) { + return env.Global().Get("Number").ToObject().Get("isInteger").As().Call({num}).ToBoolean().Value(); +} diff --git a/src/js/src/llm_pipeline/finish_chat_worker.cpp b/src/js/src/llm_pipeline/finish_chat_worker.cpp new file mode 100644 index 0000000000..b07284688c --- /dev/null +++ b/src/js/src/llm_pipeline/finish_chat_worker.cpp @@ -0,0 +1,14 @@ +#include "include/llm_pipeline/finish_chat_worker.hpp" +#include +#include + +FinishChatWorker::FinishChatWorker(Function& callback, std::shared_ptr& pipe) + : AsyncWorker(callback), pipe(pipe) {}; + +void FinishChatWorker::Execute() { + this->pipe->finish_chat(); +}; + +void FinishChatWorker::OnOK() { + Callback().Call({ Env().Null() }); +}; diff --git a/src/js/src/llm_pipeline/init_worker.cpp b/src/js/src/llm_pipeline/init_worker.cpp new file mode 100644 index 0000000000..87dd1aaf34 --- /dev/null +++ b/src/js/src/llm_pipeline/init_worker.cpp @@ -0,0 +1,18 @@ +#include "include/llm_pipeline/init_worker.hpp" +#include +#include + +InitWorker::InitWorker( + Function& callback, + std::shared_ptr& pipe, + const std::string model_path, + const std::string device +) : AsyncWorker(callback), pipe(pipe), model_path(model_path), device(device) {}; + +void InitWorker::Execute() { + this->pipe = std::make_shared(this->model_path, this->device); +}; + +void InitWorker::OnOK() { + Callback().Call({ Env().Null() }); +}; diff --git a/src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp b/src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp new file mode 100644 index 0000000000..47bc9b352b --- /dev/null +++ b/src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp @@ -0,0 +1,153 @@ +#include "include/helper.hpp" + +#include "include/llm_pipeline/llm_pipeline_wrapper.hpp" +#include "include/llm_pipeline/start_chat_worker.hpp" +#include "include/llm_pipeline/finish_chat_worker.hpp" +#include "include/llm_pipeline/init_worker.hpp" + +struct TsfnContext { + TsfnContext(std::string prompt) : prompt(prompt) {}; + ~TsfnContext() {}; + + std::thread native_thread; + Napi::ThreadSafeFunction tsfn; + + std::string prompt; + std::shared_ptr pipe = nullptr; + std::shared_ptr options = nullptr; +}; + +void performInferenceThread(TsfnContext* context) { + try { + ov::genai::GenerationConfig config; + config.update_generation_config(*context->options); + + std::function streamer = [context](std::string word) { + napi_status status = context->tsfn.BlockingCall([word](Napi::Env env, Napi::Function jsCallback) { + try { + jsCallback.Call({ + Napi::Boolean::New(env, false), + Napi::String::New(env, word) + }); + } catch(std::exception& err) { + Napi::Error::Fatal("performInferenceThread callback error. Details:" , err.what()); + } + }); + if (status != napi_ok) { + // Handle error + Napi::Error::Fatal("performInferenceThread error", "napi_status != napi_ok"); + } + + // Return flag corresponds whether generation should be stopped. + // false means continue generation. + return false; + }; + + context->pipe->generate(context->prompt, config, streamer); + napi_status status = context->tsfn.BlockingCall([](Napi::Env env, Napi::Function jsCallback) { + jsCallback.Call({ + Napi::Boolean::New(env, true), + }); + }); + + if (status != napi_ok) { + // Handle error + Napi::Error::Fatal("performInferenceThread error", "napi_status != napi_ok"); + } + + context->tsfn.Release(); + } + catch(std::exception& e) { + Napi::Error::Fatal("performInferenceThread error" , e.what()); + + context->tsfn.Release(); + } +} + +LLMPipelineWrapper::LLMPipelineWrapper(const Napi::CallbackInfo& info) : Napi::ObjectWrap(info) {}; + +Napi::Function LLMPipelineWrapper::get_class(Napi::Env env) { + return DefineClass(env, + "LLMPipeline", + {InstanceMethod("init", &LLMPipelineWrapper::init), + InstanceMethod("generate", &LLMPipelineWrapper::generate), + InstanceMethod("startChat", &LLMPipelineWrapper::start_chat), + InstanceMethod("finishChat", &LLMPipelineWrapper::finish_chat)}); +} + +Napi::Value LLMPipelineWrapper::init(const Napi::CallbackInfo& info) { + Napi::Env env = info.Env(); + const std::string model_path = info[0].ToString(); + const std::string device = info[1].ToString(); + Napi::Function callback = info[2].As(); + + InitWorker* asyncWorker = new InitWorker(callback, this->pipe, model_path, device); + asyncWorker->Queue(); + + return info.Env().Undefined(); +} + +Napi::Value LLMPipelineWrapper::generate(const Napi::CallbackInfo& info) { + Napi::Env env = info.Env(); + TsfnContext* context = nullptr; + + try { + std::string prompt = info[0].ToString(); + ov::AnyMap options; + if (info.Length() == 3) { + options = to_anyMap(info.Env(), info[2]); + } + + context = new TsfnContext(prompt); + context->pipe = this->pipe; + context->options = std::make_shared(options); + // Create a ThreadSafeFunction + context->tsfn = Napi::ThreadSafeFunction::New( + env, + info[1].As(), // JavaScript function called asynchronously + "TSFN", // Name + 0, // Unlimited queue + 1, // Only one thread will use this initially + [context](Napi::Env) { // Finalizer used to clean threads up + // std::cout << "Finalize TFSN" << std::endl; + context->native_thread.join(); + delete context; + } + ); + context->native_thread = std::thread(performInferenceThread, context); + + return Napi::Boolean::New(env, false); + } catch(Napi::TypeError& type_err) { + throw type_err; + } catch(std::exception& err) { + std::cout << "Catch in the thread: '" << err.what() << "'" << std::endl; + if (context != nullptr) { + context->tsfn.Release(); + } + + throw Napi::Error::New(env, err.what()); + } + + return Napi::Boolean::New(env, true); +} + +Napi::Value LLMPipelineWrapper::start_chat(const Napi::CallbackInfo& info) { + Napi::Env env = info.Env(); + Napi::Function callback = info[0].As(); + + StartChatWorker* asyncWorker = new StartChatWorker(callback, this->pipe); + asyncWorker->Queue(); + + return info.Env().Undefined(); +} + +Napi::Value LLMPipelineWrapper::finish_chat(const Napi::CallbackInfo& info) { + Napi::Env env = info.Env(); + + Napi::Function callback = info[0].As(); + + FinishChatWorker* asyncWorker = new FinishChatWorker(callback, this->pipe); + asyncWorker->Queue(); + + return info.Env().Undefined(); +} diff --git a/src/js/src/llm_pipeline/start_chat_worker.cpp b/src/js/src/llm_pipeline/start_chat_worker.cpp new file mode 100644 index 0000000000..302c505105 --- /dev/null +++ b/src/js/src/llm_pipeline/start_chat_worker.cpp @@ -0,0 +1,14 @@ +#include "include/llm_pipeline/start_chat_worker.hpp" +#include +#include + +StartChatWorker::StartChatWorker(Function& callback, std::shared_ptr& pipe) + : AsyncWorker(callback), pipe(pipe) {}; + +void StartChatWorker::Execute() { + this->pipe->start_chat(); +}; + +void StartChatWorker::OnOK() { + Callback().Call({ Env().Null() }); +}; diff --git a/src/js/tests/bindings.test.js b/src/js/tests/bindings.test.js new file mode 100644 index 0000000000..72ca7f02fc --- /dev/null +++ b/src/js/tests/bindings.test.js @@ -0,0 +1,58 @@ +import addon from '../lib/bindings.cjs'; + +import assert from 'node:assert'; +import { describe, it, before, after } from 'node:test'; +import { models } from './models.js'; + +const MODEL_PATH = process.env.MODEL_PATH + || `./tests/models/${models[0].split('/')[1]}`; + +describe('bindings', () => { + let pipeline = null; + + before((_, done) => { + pipeline = new addon.LLMPipeline(); + + pipeline.init(MODEL_PATH, 'CPU', (err) => { + if (err) { + console.error(err); + process.exit(1); + } + + pipeline.startChat((err) => { + if (err) { + console.error(err); + process.exit(1); + } + + done(); + }); + }); + }); + + after((_, done) => { + pipeline.finishChat((err) => { + if (err) { + console.error(err); + process.exit(1); + } + + done(); + }); + }); + + it('should generate string result', (_, done) => { + let output = ''; + + pipeline.generate('Continue: 1 2 3', (isDone, chunk) => { + if (!isDone) { + output += chunk; + + return; + } + + assert.ok(output.length > 0); + done(); + }, { temperature: '0', max_new_tokens: '4' }); + }); +}); diff --git a/src/js/tests/models.js b/src/js/tests/models.js new file mode 100644 index 0000000000..03c689e038 --- /dev/null +++ b/src/js/tests/models.js @@ -0,0 +1,3 @@ +export const models = [ + 'OpenVINO/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov', +]; diff --git a/src/js/tests/module.test.js b/src/js/tests/module.test.js new file mode 100644 index 0000000000..0825625d3f --- /dev/null +++ b/src/js/tests/module.test.js @@ -0,0 +1,142 @@ +import { Pipeline } from '../lib/module.js'; + +import assert from 'node:assert/strict'; +import { describe, it, before, after } from 'node:test'; +import { models } from './models.js'; + +const MODEL_PATH = process.env.MODEL_PATH + || `./tests/models/${models[0].split('/')[1]}`; + +describe('module', async () => { + let pipeline = null; + + await before(async () => { + pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU'); + + await pipeline.startChat(); + }); + + await after(async () => { + await pipeline.finishChat(); + }); + + await it('should generate non empty string', async () => { + const result = await pipeline.generate( + 'Type something in English', + { temperature: '0', max_new_tokens: '4' }, + () => {}, + ); + + assert.ok(result.length > 0); + }); +}); + +describe('corner cases', async () => { + it('should throw an error if pipeline is already initialized', async () => { + const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU'); + + await assert.rejects( + async () => await pipeline.init(), + { + name: 'Error', + message: 'Pipeline is already initialized', + }, + ); + }); + + it('should throw an error if chat is already started', async () => { + const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU'); + + await pipeline.startChat(); + + await assert.rejects( + () => pipeline.startChat(), + { + name: 'Error', + message: 'Chat is already started', + }, + ); + }); + + it('should throw an error if chat is not started', async () => { + const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU'); + + await assert.rejects( + () => pipeline.finishChat(), + { + name: 'Error', + message: 'Chat is not started', + }, + ); + }); +}); + +describe('generation parameters validation', () => { + let pipeline = null; + + before(async () => { + pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU'); + + await pipeline.startChat(); + }); + + after(async () => { + await pipeline.finishChat(); + }); + + it('should throw an error if temperature is not a number', async () => { + await assert.rejects( + async () => await pipeline.generate(), + { + name: 'Error', + message: 'Prompt must be a string', + }, + ); + }); + + it('should throw an error if generationCallback is not a function', async () => { + const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU'); + + await pipeline.startChat(); + + await assert.rejects( + async () => await pipeline.generate('prompt', {}, false), + { + name: 'Error', + message: 'Generation callback must be a function', + }, + ); + }); + + it('should throw an error if options specified but not an object', async () => { + await assert.rejects( + async () => await pipeline.generate('prompt', 'options', () => {}), + { + name: 'Error', + message: 'Options must be an object', + }, + ); + }); + + it('should perform generation with default options', async () => { + try { + await pipeline.generate('prompt', { max_new_tokens: 1 }); + } catch (error) { + assert.fail(error); + } + + assert.ok(true); + }); + + it('should return a string as generation result', async () => { + const reply = await pipeline.generate('prompt', { max_new_tokens: 1 }); + + assert.strictEqual(typeof reply, 'string'); + }); + + it('should call generationCallback with string chunk', async () => { + await pipeline.generate('prompt', { max_new_tokens: 1 }, (chunk) => { + assert.strictEqual(typeof chunk, 'string'); + }); + }); +}); diff --git a/src/js/tests/setup.js b/src/js/tests/setup.js new file mode 100644 index 0000000000..3b52651719 --- /dev/null +++ b/src/js/tests/setup.js @@ -0,0 +1,6 @@ +import { dowloadModel } from './utils.js'; +import { models } from './models.js'; + +for (const model of models) { + await dowloadModel(model); +} diff --git a/src/js/tests/utils.js b/src/js/tests/utils.js new file mode 100644 index 0000000000..504782d8d1 --- /dev/null +++ b/src/js/tests/utils.js @@ -0,0 +1,47 @@ +import { bootstrap } from 'global-agent'; +import { promises as fs } from 'node:fs'; +import { listFiles, downloadFile } from '@huggingface/hub'; + +const BASE_DIR = './tests/models/'; + +bootstrap(); + +export async function dowloadModel(repo) { + console.log(`Downloading model '${repo}'`); + + const fetch = await import('node-fetch'); + const modelName = repo.split('/')[1]; + const destDir = `${BASE_DIR}${modelName}`; + + await fs.mkdir(destDir, { recursive: true }); + + const fileList = await listFiles({ + repo, + fetch: fetch.default, + }); + const fileNames = []; + for await (const file of fileList) { + fileNames.push(file.path); + } + + for (const path of fileNames) { + console.log(`Downloading file '${path}'`); + const response = await downloadFile({ + repo, + path, + fetch: fetch.default, + }); + const filename = `${destDir}/${path}`; + + await saveFile(filename, response); + console.log(`File '${path}' downloaded`); + } + + console.log(`Model '${repo}' downloaded`); +} + +async function saveFile(file, response) { + const arrayBuffer = await response.arrayBuffer(); + + await fs.writeFile(file, Buffer.from(arrayBuffer)); +} diff --git a/src/js/thirdparty/node-lib.def b/src/js/thirdparty/node-lib.def new file mode 100644 index 0000000000..8d46bbec84 --- /dev/null +++ b/src/js/thirdparty/node-lib.def @@ -0,0 +1,147 @@ +NAME NODE.EXE +EXPORTS +napi_async_destroy +napi_async_init +napi_cancel_async_work +napi_create_async_work +napi_create_buffer +napi_create_buffer_copy +napi_create_external_buffer +napi_delete_async_work +napi_fatal_error +napi_get_buffer_info +napi_get_node_version +napi_is_buffer +napi_make_callback +napi_module_register +napi_queue_async_work +napi_adjust_external_memory +napi_call_function +napi_close_escapable_handle_scope +napi_close_handle_scope +napi_coerce_to_bool +napi_coerce_to_number +napi_coerce_to_object +napi_coerce_to_string +napi_create_array +napi_create_array_with_length +napi_create_arraybuffer +napi_create_dataview +napi_create_double +napi_create_error +napi_create_external +napi_create_external_arraybuffer +napi_create_function +napi_create_int32 +napi_create_int64 +napi_create_object +napi_create_promise +napi_create_range_error +napi_create_reference +napi_create_string_latin1 +napi_create_string_utf16 +napi_create_string_utf8 +napi_create_symbol +napi_create_type_error +napi_create_typedarray +napi_create_uint32 +napi_define_class +napi_define_properties +napi_delete_element +napi_delete_property +napi_delete_reference +napi_escape_handle +napi_get_and_clear_last_exception +napi_get_array_length +napi_get_arraybuffer_info +napi_get_boolean +napi_get_cb_info +napi_get_dataview_info +napi_get_element +napi_get_global +napi_get_last_error_info +napi_get_named_property +napi_get_new_target +napi_get_null +napi_get_property +napi_get_property_names +napi_get_prototype +napi_get_reference_value +napi_get_typedarray_info +napi_get_undefined +napi_get_value_bool +napi_get_value_double +napi_get_value_external +napi_get_value_int32 +napi_get_value_int64 +napi_get_value_string_latin1 +napi_get_value_string_utf16 +napi_get_value_string_utf8 +napi_get_value_uint32 +napi_get_version +napi_has_element +napi_has_named_property +napi_has_own_property +napi_has_property +napi_instanceof +napi_is_array +napi_is_arraybuffer +napi_is_dataview +napi_is_error +napi_is_exception_pending +napi_is_promise +napi_is_typedarray +napi_new_instance +napi_open_escapable_handle_scope +napi_open_handle_scope +napi_reference_ref +napi_reference_unref +napi_reject_deferred +napi_remove_wrap +napi_resolve_deferred +napi_run_script +napi_set_element +napi_set_named_property +napi_set_property +napi_strict_equals +napi_throw +napi_throw_error +napi_throw_range_error +napi_throw_type_error +napi_typeof +napi_unwrap +napi_wrap +napi_get_uv_event_loop +napi_add_env_cleanup_hook +napi_close_callback_scope +napi_fatal_exception +napi_open_callback_scope +napi_remove_env_cleanup_hook +napi_acquire_threadsafe_function +napi_call_threadsafe_function +napi_create_threadsafe_function +napi_get_threadsafe_function_context +napi_ref_threadsafe_function +napi_release_threadsafe_function +napi_unref_threadsafe_function +napi_add_finalizer +napi_create_date +napi_get_date_value +napi_is_date +napi_create_bigint_int64 +napi_create_bigint_uint64 +napi_create_bigint_words +napi_get_all_property_names +napi_get_instance_data +napi_get_value_bigint_int64 +napi_get_value_bigint_uint64 +napi_get_value_bigint_words +napi_set_instance_data +napi_detach_arraybuffer +napi_is_detached_arraybuffer +napi_add_async_cleanup_hook +napi_remove_async_cleanup_hook +napi_check_object_type_tag +napi_object_freeze +napi_object_seal +napi_type_tag_object diff --git a/src/js/thirdparty/win_delay_load_hook.cc b/src/js/thirdparty/win_delay_load_hook.cc new file mode 100644 index 0000000000..9e652fa4df --- /dev/null +++ b/src/js/thirdparty/win_delay_load_hook.cc @@ -0,0 +1,52 @@ +/* + * When this file is linked to a DLL, it sets up a delay-load hook that + * intervenes when the DLL is trying to load 'node.exe' or 'iojs.exe' + * dynamically. Instead of trying to locate the .exe file it'll just return + * a handle to the process image. + * + * This allows compiled addons to work when node.exe or iojs.exe is renamed. + */ + +#ifdef _MSC_VER + +#ifndef WIN32_LEAN_AND_MEAN +#define WIN32_LEAN_AND_MEAN +#endif + +#include + +#include +#include + +static HMODULE node_dll = NULL; +static HMODULE nw_dll = NULL; + +static FARPROC WINAPI load_exe_hook(unsigned int event, DelayLoadInfo* info) { + if (event == dliNotePreGetProcAddress) { + FARPROC ret = NULL; + ret = GetProcAddress(node_dll, info->dlp.szProcName); + if (ret) + return ret; + ret = GetProcAddress(nw_dll, info->dlp.szProcName); + return ret; + } + if (event == dliStartProcessing) { + node_dll = GetModuleHandleA("node.dll"); + nw_dll = GetModuleHandleA("nw.dll"); + return NULL; + } + if (event != dliNotePreLoadLibrary) + return NULL; + + if (_stricmp(info->szDll, "node.exe") != 0) + return NULL; + + // Fall back to the current process + if(!node_dll) node_dll = GetModuleHandleA(NULL); + + return (FARPROC) node_dll; +} + +decltype(__pfnDliNotifyHook2) __pfnDliNotifyHook2 = load_exe_hook; + +#endif From 5cbadd1603c4019a046bbf46b0dd87feab1e7cbd Mon Sep 17 00:00:00 2001 From: Ilya Lavrenov Date: Wed, 29 Jan 2025 12:42:41 +0400 Subject: [PATCH 03/15] CB: preparation for relying on KV cache precisions from plugins (#1634) - Currently we have logic to detect KV cache precision and this logic become more and more complex - The idea is to rely on plugin's logic and compiled PA model with `ov::element::dynamic` precisions for KV cache inputs. - Later, take `ov::CompiledModel` and extract precisions from its `inputs()` - Then create tensors based on computed `num_kv_blocks` which depends on KV cache precisions. Currently, logic to mimic plugin's logic for KV cache precisions is still here, but will be dropped once plugin will support `ov::element::dynamic` --- .github/labeler.yml | 4 +- src/cpp/src/cache_manager.hpp | 169 ++++++++++++------ src/cpp/src/continuous_batching_impl.cpp | 127 +++++++++++-- src/cpp/src/continuous_batching_impl.hpp | 4 +- src/cpp/src/device_config.hpp | 115 +----------- .../paged_attention_transformations.cpp | 24 ++- .../paged_attention_transformations.hpp | 0 src/cpp/src/scheduler.hpp | 30 ++-- ...batching_for_speculative_decoding_impl.cpp | 3 +- ...batching_for_speculative_decoding_impl.hpp | 3 +- .../speculative_decoding_impl.cpp | 11 +- tests/cpp/CMakeLists.txt | 6 +- tests/cpp/cache_manager.cpp | 91 ++++------ tests/cpp/device_config.cpp | 33 ---- tests/cpp/helper.cpp | 27 +++ tests/cpp/helper.hpp | 8 + tests/cpp/scheduler.cpp | 34 +--- tests/cpp/speculative_decoding.cpp | 3 +- 18 files changed, 352 insertions(+), 340 deletions(-) rename src/cpp/src/{utils => }/paged_attention_transformations.cpp (80%) rename src/cpp/src/{utils => }/paged_attention_transformations.hpp (100%) delete mode 100644 tests/cpp/device_config.cpp create mode 100644 tests/cpp/helper.cpp create mode 100644 tests/cpp/helper.hpp diff --git a/.github/labeler.yml b/.github/labeler.yml index 2bfe4248c1..a75abd795c 100644 --- a/.github/labeler.yml +++ b/.github/labeler.yml @@ -103,8 +103,8 @@ - 'src/cpp/src/generation_handle.cpp' - 'src/cpp/src/generation_stream.hpp' - 'src/cpp/src/model_runner.hpp' -- 'src/cpp/src/utils/paged_attention_transformations.cpp' -- 'src/cpp/src/utils/paged_attention_transformations.hpp' +- 'src/cpp/src/paged_attention_transformations.cpp' +- 'src/cpp/src/paged_attention_transformations.hpp' - 'src/cpp/src/scheduler.hpp' - 'src/cpp/src/sequence_group.cpp' - 'src/cpp/src/sequence_group.hpp' diff --git a/src/cpp/src/cache_manager.hpp b/src/cpp/src/cache_manager.hpp index 5a0ff9b9f3..255bb926be 100644 --- a/src/cpp/src/cache_manager.hpp +++ b/src/cpp/src/cache_manager.hpp @@ -45,19 +45,19 @@ class TensorMmapAllocator { #endif namespace ov::genai { + class CacheManager { - DeviceConfig m_device_config; - std::vector m_key_cache; - std::vector m_value_cache; - size_t m_num_allocated_kv_blocks = 0; + size_t m_num_decoder_layers = 0; + std::string m_device; + std::vector m_key_precisions, m_value_precisions; + std::vector m_key_shapes, m_value_shapes; + std::vector m_key_cache, m_value_cache; + size_t m_num_allocated_kv_blocks = 0, m_block_size_in_bytes = 0; ov::InferRequest m_request; - ov::Core m_core; - ov::Shape set_first_dim_and_make_static(const ov::PartialShape& shape, size_t dim) { - ov::PartialShape res_shape = shape; - res_shape[0] = dim; - OPENVINO_ASSERT(res_shape.is_static()); - return res_shape.to_shape(); + static ov::Shape set_kv_blocks(ov::PartialShape pshape, size_t num_kv_blocks) { + pshape[0] = num_kv_blocks; + return pshape.get_shape(); } void update_request_tensor(size_t decoder_layer_id) { @@ -65,41 +65,106 @@ class CacheManager { m_request.set_tensor(std::string("value_cache.") + std::to_string(decoder_layer_id), m_value_cache[decoder_layer_id]); } + ov::PartialShape patch_shape(ov::PartialShape pshape, ov::element::Type cache_type) { + OPENVINO_ASSERT(!m_device.empty(), "Internal error: device is not set"); + + if (m_device.find("CPU") != std::string::npos && cache_type == ov::element::u8) { + // Scale, zero point and quantized data will be stored together. + // The layout for per token per head: + // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)| + // so, we have to extend head_size by 8, which is sizeof(float) + // for scale and sizeof(float) for zeropoint + pshape[3] += 2 * sizeof(float); + } + + return pshape; + } + public: - explicit CacheManager(const DeviceConfig &device_config, ov::InferRequest request, ov::Core core) : - m_device_config(device_config), - m_request(request), - m_core(core) { - m_key_cache.reserve(m_device_config.get_num_layers()); - m_value_cache.reserve(m_device_config.get_num_layers()); + CacheManager(ov::InferRequest request, const DeviceConfig& device_config) : + m_request(request) { + // extract information about inference device + ov::CompiledModel compiled_model = request.get_compiled_model(); + std::vector execution_devices = compiled_model.get_property(ov::execution_devices); + OPENVINO_ASSERT(execution_devices.size() == 1, "Contituous batching: execution device is expected to be CPU or GPU, but got ", execution_devices.size(), " devices"); + m_device = execution_devices[0]; + + // extract information about KV cache precisions and shapes + size_t kv_input_index = 0; + for (const auto& input : compiled_model.inputs()) { + for (auto & name : input.get_names()) { + auto cache_precision = input.get_element_type(); + + if (name.find("key_cache.") == 0) { + auto pshape = patch_shape(device_config.get_key_cache_shape(kv_input_index), cache_precision); + m_key_shapes.push_back(pshape); + m_key_precisions.push_back(cache_precision); + m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size(); + break; + } else if (name.find("value_cache.") == 0) { + auto pshape = patch_shape(device_config.get_value_cache_shape(kv_input_index), cache_precision); + m_value_shapes.push_back(pshape); + m_value_precisions.push_back(cache_precision); + m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size(); + ++kv_input_index; + break; + } + } + } + + m_num_decoder_layers = m_value_precisions.size(); + OPENVINO_ASSERT(m_num_decoder_layers == m_key_precisions.size(), "Invalid case: a different number of K and V caches in a LLM model"); + } + + size_t get_num_decoder_layers() const { + return m_num_decoder_layers; + } + + std::string get_device() const { + return m_device; + } + + ov::element::Type get_key_cache_precision(size_t decoder_layer_id) const { + OPENVINO_ASSERT(decoder_layer_id < m_key_precisions.size()); + return m_key_precisions[decoder_layer_id]; + } + + ov::element::Type get_value_cache_precision(size_t decoder_layer_id) const { + OPENVINO_ASSERT(decoder_layer_id < m_value_precisions.size()); + return m_value_precisions[decoder_layer_id]; + } + + size_t get_block_size_in_bytes() const { + return m_block_size_in_bytes; } void allocate_cache_if_needed(size_t num_kv_blocks) { if (m_num_allocated_kv_blocks >= num_kv_blocks) { return; } - OPENVINO_ASSERT(m_key_cache.size() == m_value_cache.size()); - m_num_allocated_kv_blocks = num_kv_blocks; - const std::string device_name = m_device_config.get_device(); + m_num_allocated_kv_blocks = num_kv_blocks; ov::Coordinate start_key{0,0,0,0}; ov::Coordinate start_value{0,0,0,0}; - if (device_name.find("GPU") == std::string::npos) {// Allocate KV caches - for (size_t decoder_layer_id = 0; decoder_layer_id < m_device_config.get_num_layers(); ++decoder_layer_id) { - ov::Shape value_cache_shape = set_first_dim_and_make_static(m_device_config.get_value_cache_shape(decoder_layer_id), num_kv_blocks); - ov::Shape key_cache_shape = set_first_dim_and_make_static(m_device_config.get_key_cache_shape(decoder_layer_id), num_kv_blocks); + if (m_device.find("GPU") == std::string::npos) {// Allocate KV caches + for (size_t decoder_layer_id = 0; decoder_layer_id < m_num_decoder_layers; ++decoder_layer_id) { + ov::Shape value_cache_shape = set_kv_blocks(m_value_shapes[decoder_layer_id], num_kv_blocks); + ov::Shape key_cache_shape = set_kv_blocks(m_key_shapes[decoder_layer_id], num_kv_blocks); + + ov::element::Type key_precision = get_key_cache_precision(decoder_layer_id); + ov::element::Type value_precision = get_value_cache_precision(decoder_layer_id); + #ifdef _WIN32 - ov::Tensor key_cache(m_device_config.get_cache_precision(), key_cache_shape); - ov::Tensor value_cache(m_device_config.get_cache_precision(), value_cache_shape); + ov::Tensor key_cache(key_precision, key_cache_shape); + ov::Tensor value_cache(value_precision, value_cache_shape); #else - auto key_size = ov::shape_size(key_cache_shape) * m_device_config.get_cache_precision().size(); - auto value_size = ov::shape_size(value_cache_shape) * m_device_config.get_cache_precision().size(); - - ov::Tensor key_cache = ov::Tensor(m_device_config.get_cache_precision(), key_cache_shape, TensorMmapAllocator(key_size)); - ov::Tensor value_cache = ov::Tensor(m_device_config.get_cache_precision(), value_cache_shape, TensorMmapAllocator(value_size)); + auto key_size = ov::shape_size(key_cache_shape) * key_precision.size(); + auto value_size = ov::shape_size(value_cache_shape) * value_precision.size(); + ov::Tensor key_cache(key_precision, key_cache_shape, TensorMmapAllocator(key_size)); + ov::Tensor value_cache(value_precision, value_cache_shape, TensorMmapAllocator(value_size)); #endif auto key_cache_roi_end = static_cast(key_cache.data()); @@ -137,8 +202,7 @@ class CacheManager { if (m_key_cache.size() > decoder_layer_id) { m_key_cache[decoder_layer_id] = key_cache; m_value_cache[decoder_layer_id] = value_cache; - } - else { + } else { m_key_cache.emplace_back(key_cache); m_value_cache.emplace_back(value_cache); } @@ -146,15 +210,15 @@ class CacheManager { update_request_tensor(decoder_layer_id); } } else { - auto remote_context = m_core.get_default_context(device_name); - for (size_t decoder_layer_id = 0; decoder_layer_id < m_device_config.get_num_layers(); ++decoder_layer_id) { - ov::Shape value_cache_shape = set_first_dim_and_make_static(m_device_config.get_value_cache_shape(decoder_layer_id), num_kv_blocks); - ov::Shape key_cache_shape = set_first_dim_and_make_static(m_device_config.get_key_cache_shape(decoder_layer_id), num_kv_blocks); - ov::Tensor key_cache = remote_context.create_tensor(m_device_config.get_cache_precision(), - key_cache_shape); - ov::Tensor value_cache = remote_context.create_tensor(m_device_config.get_cache_precision(), - value_cache_shape); - + auto remote_context = m_request.get_compiled_model().get_context(); + + for (size_t decoder_layer_id = 0; decoder_layer_id < m_num_decoder_layers; ++decoder_layer_id) { + ov::Shape value_cache_shape = set_kv_blocks(m_value_shapes[decoder_layer_id], num_kv_blocks); + ov::Shape key_cache_shape = set_kv_blocks(m_key_shapes[decoder_layer_id], num_kv_blocks); + + ov::Tensor key_cache = remote_context.create_tensor(get_key_cache_precision(decoder_layer_id), key_cache_shape); + ov::Tensor value_cache = remote_context.create_tensor(get_value_cache_precision(decoder_layer_id), value_cache_shape); + if (m_key_cache.size() > decoder_layer_id) { ov::Coordinate end_key = m_key_cache[decoder_layer_id].get_shape(); ov::Coordinate end_value = m_value_cache[decoder_layer_id].get_shape(); @@ -167,23 +231,23 @@ class CacheManager { m_key_cache[decoder_layer_id] = key_cache; m_value_cache[decoder_layer_id] = value_cache; - } - else { + } else { m_key_cache.emplace_back(key_cache); m_value_cache.emplace_back(value_cache); } + update_request_tensor(decoder_layer_id); } } } ov::Tensor get_key_cache(size_t decoder_layer_id) const { - OPENVINO_ASSERT(decoder_layer_id < m_key_cache.size()); + OPENVINO_ASSERT(decoder_layer_id < m_key_cache.size(), "decoder_layer_id = ", decoder_layer_id, ", num_layers = ", m_key_cache.size()); return m_key_cache[decoder_layer_id]; } ov::Tensor get_value_cache(size_t decoder_layer_id) const { - OPENVINO_ASSERT(decoder_layer_id < m_value_cache.size()); + OPENVINO_ASSERT(decoder_layer_id < m_value_cache.size(), "decoder_layer_id = ", decoder_layer_id, ", num_layers = ", m_value_cache.size()); return m_value_cache[decoder_layer_id]; } @@ -192,9 +256,9 @@ class CacheManager { size_t src_block_id = blocks_pair.first; const std::list& dst_block_ids = blocks_pair.second; for (size_t dst_block_id : dst_block_ids) { - for (size_t decoder_layer_id = 0; decoder_layer_id < m_device_config.get_num_layers(); ++decoder_layer_id) { - ov::Shape key_shape = set_first_dim_and_make_static(m_device_config.get_key_cache_shape(decoder_layer_id), m_num_allocated_kv_blocks); - ov::Shape value_shape = set_first_dim_and_make_static(m_device_config.get_value_cache_shape(decoder_layer_id), m_num_allocated_kv_blocks); + for (size_t decoder_layer_id = 0; decoder_layer_id < m_num_decoder_layers; ++decoder_layer_id) { + ov::Shape key_shape = set_kv_blocks(m_key_shapes[decoder_layer_id], m_num_allocated_kv_blocks); + ov::Shape value_shape = set_kv_blocks(m_value_shapes[decoder_layer_id], m_num_allocated_kv_blocks); ov::Coordinate key_src_start_roi(key_shape.size(), 0); ov::Coordinate key_src_end_roi = key_shape; ov::Coordinate key_dst_start_roi(key_shape.size(), 0); @@ -221,13 +285,6 @@ class CacheManager { } } } - - std::shared_ptr get_core() { - return std::make_shared(m_core); - } - - std::shared_ptr get_device_config() { - return std::make_shared(m_device_config); - } }; + } diff --git a/src/cpp/src/continuous_batching_impl.cpp b/src/cpp/src/continuous_batching_impl.cpp index 99df043090..b4100f8aec 100644 --- a/src/cpp/src/continuous_batching_impl.cpp +++ b/src/cpp/src/continuous_batching_impl.cpp @@ -7,9 +7,95 @@ #include "text_callback_streamer.hpp" #include "continuous_batching_impl.hpp" #include "utils.hpp" -#include "utils/paged_attention_transformations.hpp" +#include "paged_attention_transformations.hpp" #include "lora_helper.hpp" #include "cache_state_dumper.hpp" +#include "utils.hpp" + +namespace { + +ov::element::Type get_model_kv_cache_precision(std::shared_ptr model) { + const std::vector kv_cache_precision_path = { "runtime_options", ov::hint::kv_cache_precision.name() }; + ov::element::Type ir_kv_cache_precision = ov::element::undefined; + + if (model->has_rt_info(kv_cache_precision_path)) { + ir_kv_cache_precision = model->get_rt_info(kv_cache_precision_path); + } + + return ir_kv_cache_precision; +} + +void apply_kv_cache_precision(const std::shared_ptr& model, const std::string& device, const ov::AnyMap& plugin_config) { + ov::element::Type m_kv_cache_type = ov::element::undefined, ir_kv_cache_precision = get_model_kv_cache_precision(model); + ov::Core core = ov::genai::utils::singleton_core(); + + auto inference_precision = core.get_property(device, ov::hint::inference_precision); + // if user sets properties affecting KV cache precision + const auto inference_precision_it = plugin_config.find(ov::hint::inference_precision.name()); + const auto kv_cache_precision_it = plugin_config.find(ov::hint::kv_cache_precision.name()); + const auto execution_mode_it = plugin_config.find(ov::hint::execution_mode.name()); + const bool accuracy_mode = execution_mode_it != plugin_config.end() && + execution_mode_it->second.as() == ov::hint::ExecutionMode::ACCURACY; + + if (device == "CPU") { + if (kv_cache_precision_it != plugin_config.end()) { + const auto kv_cache_precision = kv_cache_precision_it->second.as(); + m_kv_cache_type = kv_cache_precision; + } else if (accuracy_mode) { + // ACCURACY mode will use f32 KV cache type + m_kv_cache_type = ov::element::f32; + } else if (ir_kv_cache_precision != ov::element::undefined) { + // check that kv_cache_precision is set in runtime_info section of OpenVINO IR + // but in case it's set to FP16, we need to patch it to be BF16 for Xeon platforms + m_kv_cache_type = ir_kv_cache_precision == ov::element::f16 && inference_precision == ov::element::bf16 ? + inference_precision : ir_kv_cache_precision; + } else { + // x86 and ARM have different default kv cache type, take this information from the plugin + m_kv_cache_type = core.get_property(device, ov::hint::kv_cache_precision); + } + + // TEMP WA: currently FP16 / BF16 KV cache is faster than U8 for PagedAttention + if (m_kv_cache_type == ov::element::u8) { + m_kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16; + } + } else if (device.find("GPU") != std::string::npos) { + if (accuracy_mode) { + inference_precision = ov::element::f32; + } + if (inference_precision_it != plugin_config.end()) { + inference_precision = inference_precision_it->second.as(); + } + + m_kv_cache_type = inference_precision; + } else { + OPENVINO_THROW(device, " is not supported by OpenVINO Continuous Batching"); + } + + std::map> key_cache_params, value_cache_params; + for (const auto& param_ptr : model->get_parameters()) { + const auto& name = param_ptr->get_friendly_name(); + if (name.find("key_cache.") == 0) { + key_cache_params[name] = param_ptr; + } else if (name.find("value_cache.") == 0) { + value_cache_params[name] = param_ptr; + } + } + + OPENVINO_ASSERT(key_cache_params.size() == value_cache_params.size() && key_cache_params.size() > 0); + + size_t num_decoder_layers = key_cache_params.size(); + for (size_t idx = 0; idx < num_decoder_layers; idx++) { + auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)]; + auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)]; + + k->set_element_type(m_kv_cache_type); + v->set_element_type(m_kv_cache_type); + } + + model->validate_nodes_and_infer_types(); +} + +} // namespace namespace ov::genai { template struct overloaded : Ts... {using Ts::operator()...;}; @@ -27,15 +113,14 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::ContinuousBatchingImpl( m_generation_config = generation_config; m_is_validation_mode_enabled = is_validation_mode_enabled; - ov::Core core = utils::singleton_core(); - DeviceConfig device_config(core, scheduler_config, device, properties); + DeviceConfig device_config(device); bool is_need_per_layer_cache_control = scheduler_config.use_cache_eviction; bool allow_cache_rotation = scheduler_config.cache_eviction_config.apply_rotation; utils::apply_paged_attention_transformations(model, device_config, is_need_per_layer_cache_control, allow_cache_rotation); utils::apply_gather_before_matmul_transformation(model); - initialize_pipeline(model, scheduler_config, properties, device_config, core); + initialize_pipeline(model, scheduler_config, properties, device_config); } ContinuousBatchingPipeline::ContinuousBatchingImpl::~ContinuousBatchingImpl() { @@ -55,10 +140,13 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( std::shared_ptr model, const SchedulerConfig& scheduler_config, const ov::AnyMap& properties, - const DeviceConfig& device_config, - ov::Core& core) { + const DeviceConfig& device_config) { + ov::Core core = utils::singleton_core(); ov::CompiledModel compiled_model; + // TODO: remove once plugin automatically set KV cache precisions + apply_kv_cache_precision(model, device_config.get_device(), properties); + // apply LoRA if (auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters)) { m_generation_config.adapters->set_tensor_name_prefix("base_model.model.model."); @@ -71,24 +159,27 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( ov::genai::utils::print_compiled_model_properties(compiled_model, "LLM with Paged Attention"); ov::InferRequest infer_request = compiled_model.create_infer_request(); - m_num_decoder_layers = device_config.get_num_layers(); - - // setup KV caches - std::shared_ptr cache_manager = std::make_shared(device_config, infer_request, core); + // Cache manager + std::shared_ptr cache_manager = std::make_shared(infer_request, device_config); + m_num_decoder_layers = cache_manager->get_num_decoder_layers(); - SchedulerConfig updated_config = scheduler_config; - // update KV blocks number in scheduler config - if (scheduler_config.num_kv_blocks != device_config.get_num_kv_blocks()) { - updated_config.num_kv_blocks = device_config.get_num_kv_blocks(); + // Scheduler + SchedulerConfig normalized_config = scheduler_config; + if (normalized_config.num_kv_blocks == 0 && normalized_config.cache_size > 0) { + size_t size_in_bytes = normalized_config.cache_size * 1024 * 1024 * 1024; // convert GBs to bytes + normalized_config.num_kv_blocks = size_in_bytes / cache_manager->get_block_size_in_bytes(); } bool can_use_partial_preemption = true; - if (device_config.get_device().find("GPU") != std::string::npos && !updated_config.dynamic_split_fuse) { + if (device_config.get_device().find("GPU") != std::string::npos && !normalized_config.dynamic_split_fuse) { // in case of executing a `vLLM-like` pipeline, it's better not to use partial eviction on the GPU, // as it may lead to performance slowdown can_use_partial_preemption = false; } - m_scheduler = std::make_shared(device_config.get_block_size(), cache_manager, updated_config, device_config.get_num_layers(), can_use_partial_preemption); + + m_scheduler = std::make_shared(device_config.get_block_size(), cache_manager, normalized_config, m_num_decoder_layers, can_use_partial_preemption); + + // Model Runner bool is_use_cache_eviction = m_scheduler->get_config().use_cache_eviction; if (is_use_cache_eviction) { const auto& eviction_config = m_scheduler->get_config().cache_eviction_config; @@ -101,14 +192,14 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( /* is_use_rotation_inputs = */ is_apply_rotation); if (eviction_config.apply_rotation) { m_rotation_deltas_stores.reserve(m_num_decoder_layers); - ov::Shape rotation_deltas_store_shape{scheduler_config.num_kv_blocks, 1}; // last dim can be later changed to BLOCK_SIZE for per-token granularity + ov::Shape rotation_deltas_store_shape{normalized_config.num_kv_blocks, 1}; // last dim can be later changed to BLOCK_SIZE for per-token granularity for (size_t i = 0; i < m_num_decoder_layers; i++) { ov::Tensor store(ov::element::i32, rotation_deltas_store_shape); std::memset(store.data(), 0, store.get_byte_size()); m_rotation_deltas_stores.push_back(store); } - size_t max_sequence_cache_occupation_length_in_blocks = scheduler_config.max_num_batched_tokens / m_scheduler->get_block_size() + 1; + size_t max_sequence_cache_occupation_length_in_blocks = normalized_config.max_num_batched_tokens / m_scheduler->get_block_size() + 1; size_t embedding_size = device_config.get_k_head_size(0); m_cache_rotation_calculator = std::make_shared( m_scheduler->get_block_size(), diff --git a/src/cpp/src/continuous_batching_impl.hpp b/src/cpp/src/continuous_batching_impl.hpp index f64657bc7a..9fa6c9c660 100644 --- a/src/cpp/src/continuous_batching_impl.hpp +++ b/src/cpp/src/continuous_batching_impl.hpp @@ -59,9 +59,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc void initialize_pipeline(std::shared_ptr model, const SchedulerConfig& scheduler_config, const ov::AnyMap& plugin_config, - const DeviceConfig& device_config, - ov::Core& core); - + const DeviceConfig& device_config); /** * Pulls requests from awaiting queue to running queue diff --git a/src/cpp/src/device_config.hpp b/src/cpp/src/device_config.hpp index 3d41960c5e..09020da9a8 100644 --- a/src/cpp/src/device_config.hpp +++ b/src/cpp/src/device_config.hpp @@ -20,11 +20,8 @@ struct KVHeadConfig { }; class DeviceConfig { - ov::element::Type m_kv_cache_type; std::vector m_key_cache_shape, m_value_cache_shape; std::vector m_kv_heads_config; - size_t m_num_decoder_layers = 0; - size_t m_num_kv_blocks = 0, m_cache_size = 0; // KV cache sizes in either blocks or GBs size_t m_block_size = 0; // block size is per inference device std::string m_device; @@ -35,90 +32,17 @@ class DeviceConfig { } public: - DeviceConfig(ov::Core& core, const SchedulerConfig& scheduling_config, const std::string& device, const ov::AnyMap& plugin_config = {}) { + explicit DeviceConfig(const std::string& device) { m_device = device; - - // keep information about blocsk m_block_size = get_block_size_by_device(device); - - if (m_device == "CPU") { - auto inference_precision = core.get_property(device, ov::hint::inference_precision); - m_kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16; - - // if user sets precision hint, kv cache type should be changed - const auto inference_precision_it = plugin_config.find(ov::hint::inference_precision.name()); - if (inference_precision_it != plugin_config.end()) { - const auto inference_precision = inference_precision_it->second.as(); - if (inference_precision == ov::element::f32) { - m_kv_cache_type = ov::element::f32; - } else if (inference_precision == ov::element::f16) { - m_kv_cache_type = ov::element::f16; - } else if (inference_precision == ov::element::bf16) { - m_kv_cache_type = ov::element::bf16; - } else { - // use default f32 - m_kv_cache_type = ov::element::f32; - } - } - - // if user sets ov::kv_cache_precision hint - const auto kv_cache_precision_it = plugin_config.find(ov::hint::kv_cache_precision.name()); - if (kv_cache_precision_it != plugin_config.end()) { - const auto kv_cache_precision = kv_cache_precision_it->second.as(); - m_kv_cache_type = kv_cache_precision; - } - } else if (m_device.find("GPU") != std::string::npos) { - auto inference_precision = core.get_property(device, ov::hint::inference_precision); - m_kv_cache_type = inference_precision == ov::element::f16 ? ov::element::f16 : ov::element::f32; - - // if user sets precision hint, kv cache type should be changed - const auto inference_precision_it = plugin_config.find(ov::hint::inference_precision.name()); - if (inference_precision_it != plugin_config.end()) { - const auto inference_precision = inference_precision_it->second.as(); - if (inference_precision == ov::element::f16) { - m_kv_cache_type = ov::element::f16; - } else { - // use default f32 - m_kv_cache_type = ov::element::f32; - } - } - } else { - OPENVINO_THROW(m_device, " is not supported by OpenVINO Continuous Batching"); - } - - if (scheduling_config.num_kv_blocks > 0) { - m_num_kv_blocks = scheduling_config.num_kv_blocks; - } else if (scheduling_config.cache_size > 0) { - m_cache_size = scheduling_config.cache_size; - } } - void set_kv_head_configs(std::vector kv_heads_config) { + void set_kv_head_configs(const std::vector& kv_heads_config) { m_kv_heads_config = kv_heads_config; - m_num_decoder_layers = m_kv_heads_config.size(); - m_key_cache_shape.reserve(m_num_decoder_layers); - m_value_cache_shape.reserve(m_num_decoder_layers); - - if (m_device == "CPU") { - // Scale, zero point and quantized data will be stored together. - // The layout for per token per head: - // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)| - // so, we have to extend head_size by 8, which is sizeof(float) - // for scale and sizeof(float) for zeropoint - if (m_kv_cache_type == ov::element::u8) { - for (size_t layer_id = 0; layer_id < m_num_decoder_layers; ++layer_id) { - m_kv_heads_config[layer_id].k_head_size += 8; - m_kv_heads_config[layer_id].v_head_size += 8; - } - } - } + m_key_cache_shape.reserve(m_kv_heads_config.size()); + m_value_cache_shape.reserve(m_kv_heads_config.size()); - if (m_num_kv_blocks == 0 && m_cache_size > 0) { - size_t size_in_bytes = m_cache_size * 1024 * 1024 * 1024; // convert GBs to bytes - m_num_kv_blocks = size_in_bytes / get_block_size_in_bytes(); - } - - for (size_t layer_id = 0; layer_id < m_num_decoder_layers; layer_id++) { + for (size_t layer_id = 0; layer_id < kv_heads_config.size(); layer_id++) { const KVHeadConfig& config = m_kv_heads_config[layer_id]; m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(), @@ -126,7 +50,7 @@ class DeviceConfig { ov::Dimension(m_block_size), ov::Dimension(config.v_head_size)}); - if (m_device.find("GPU") == std::string::npos) { + if (m_device.find("CPU") != std::string::npos) { m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(), ov::Dimension(config.num_k_heads), ov::Dimension(m_block_size), @@ -145,44 +69,23 @@ class DeviceConfig { return m_device; } - ov::element::Type get_cache_precision() const { - return m_kv_cache_type; - } - - size_t get_num_layers() const { - return m_num_decoder_layers; - } - ov::PartialShape get_key_cache_shape(size_t id) const { OPENVINO_ASSERT(m_key_cache_shape.size()); return m_key_cache_shape[id]; } - size_t get_k_head_size(size_t layer_id) const { - return m_kv_heads_config[layer_id].k_head_size; - } - ov::PartialShape get_value_cache_shape(size_t id) const { OPENVINO_ASSERT(m_value_cache_shape.size()); return m_value_cache_shape[id]; } - size_t get_num_kv_blocks() const { - return m_num_kv_blocks; + size_t get_k_head_size(size_t layer_id) const { + return m_kv_heads_config[layer_id].k_head_size; } size_t get_block_size() const { return m_block_size; } - - size_t get_block_size_in_bytes() const { - size_t block_size_in_bytes = 0; - for (size_t layer_id = 0; layer_id < m_num_decoder_layers; layer_id++) { - const KVHeadConfig& config = m_kv_heads_config[layer_id]; - block_size_in_bytes += config.k_head_size * config.num_k_heads + config.v_head_size * config.num_v_heads; - } - block_size_in_bytes *= get_block_size() * get_cache_precision().size(); - return block_size_in_bytes; - } }; + } diff --git a/src/cpp/src/utils/paged_attention_transformations.cpp b/src/cpp/src/paged_attention_transformations.cpp similarity index 80% rename from src/cpp/src/utils/paged_attention_transformations.cpp rename to src/cpp/src/paged_attention_transformations.cpp index 17a3fdddbe..6d337136dc 100644 --- a/src/cpp/src/utils/paged_attention_transformations.cpp +++ b/src/cpp/src/paged_attention_transformations.cpp @@ -1,7 +1,7 @@ // Copyright (C) 2023-2025 Intel Corporation // SPDX-License-Identifier: Apache-2.0 -#include "utils/paged_attention_transformations.hpp" +#include "paged_attention_transformations.hpp" #include "openvino/pass/manager.hpp" #include "openvino/pass/sdpa_to_paged_attention.hpp" @@ -10,7 +10,6 @@ namespace ov { namespace genai { namespace utils { - size_t get_hidden_size(const std::shared_ptr model) { const auto& parameters = model->get_parameters(); // extract num_kv_heads and head_size @@ -50,23 +49,32 @@ void set_kv_cache_type_and_shape(std::shared_ptr model, DeviceConfig& for (size_t idx = 0; idx < num_decoder_layers; idx++) { KVHeadConfig& config = kv_heads_config[idx]; - auto key_shape = key_cache_params[std::string("key_cache.") + std::to_string(idx)]->get_partial_shape(); + auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)]; + auto key_shape = k->get_partial_shape(); config.num_k_heads = key_shape[1].get_length(); config.k_head_size = key_shape[2].get_length(); - auto value_shape = value_cache_params[std::string("value_cache.") + std::to_string(idx)]->get_partial_shape(); + auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)]; + auto value_shape = v->get_partial_shape(); config.num_v_heads = value_shape[1].get_length(); config.v_head_size = value_shape[2].get_length(); } + + // save information about KV caches in device_config + // and create device dependent KV cache shapes device_config.set_kv_head_configs(kv_heads_config); for (size_t idx = 0; idx < num_decoder_layers; idx++) { auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)]; auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)]; - k->set_element_type(device_config.get_cache_precision()); - v->set_element_type(device_config.get_cache_precision()); - k->set_partial_shape(device_config.get_key_cache_shape(idx)); - v->set_partial_shape(device_config.get_value_cache_shape(idx)); + + // allow a plugin to automatically set KV cache precisions + k->set_element_type(ov::element::dynamic); + v->set_element_type(ov::element::dynamic); + + // set device specific KV cache shapes back to a PA model + k->set_partial_shape(ov::PartialShape::dynamic(4)); + v->set_partial_shape(ov::PartialShape::dynamic(4)); } model->validate_nodes_and_infer_types(); diff --git a/src/cpp/src/utils/paged_attention_transformations.hpp b/src/cpp/src/paged_attention_transformations.hpp similarity index 100% rename from src/cpp/src/utils/paged_attention_transformations.hpp rename to src/cpp/src/paged_attention_transformations.hpp diff --git a/src/cpp/src/scheduler.hpp b/src/cpp/src/scheduler.hpp index ba6fe44cff..23db68deab 100644 --- a/src/cpp/src/scheduler.hpp +++ b/src/cpp/src/scheduler.hpp @@ -14,6 +14,7 @@ #include "sequence_group.hpp" #include "cache_manager.hpp" #include "timer.hpp" +#include "utils.hpp" namespace ov::genai { class Scheduler { @@ -462,12 +463,12 @@ class Scheduler { } size_t _get_available_gpu_memory() { - auto device_config = m_cache_manager->get_device_config(); - auto core = m_cache_manager->get_core(); - auto device = device_config->get_device(); + auto device = m_cache_manager->get_device(); OPENVINO_ASSERT(device.find("GPU") != std::string::npos, "_get_available_gpu_memory() is applicable for GPU only."); - auto memory_statistics = core->get_property(device, ov::intel_gpu::memory_statistics); - auto device_type = core->get_property(device, ov::device::type); + + ov::Core core = utils::singleton_core(); + auto memory_statistics = core.get_property(device, ov::intel_gpu::memory_statistics); + auto device_type = core.get_property(device, ov::device::type); // sum up all used device memory std::vector device_memory_types = {"cl_mem", "usm_device"}; @@ -487,7 +488,7 @@ class Scheduler { used_device_mem *= used_memory_threshold; // total device memory in bytes - auto total_device_memory = core->get_property(device, ov::intel_gpu::device_total_mem_size); + auto total_device_memory = core.get_property(device, ov::intel_gpu::device_total_mem_size); return total_device_memory - used_device_mem; } @@ -514,32 +515,29 @@ class Scheduler { if (!m_dynamic_memory_allocation) { return false; } - auto device_config = m_cache_manager->get_device_config(); - auto device = device_config->get_device(); + auto device = m_cache_manager->get_device(); size_t current_num_of_kv_blocks = m_block_manager->get_total_number_of_kv_blocks(); size_t new_blocks_num = current_num_of_kv_blocks * m_cache_growth_factor; if (device.find("GPU") == std::string::npos) { m_block_manager->increase_kv_blocks_number(new_blocks_num); - } - else { - size_t available_gpu_memory = _get_available_gpu_memory(); - size_t required_memory = (new_blocks_num - current_num_of_kv_blocks) * device_config->get_block_size_in_bytes(); + } else { + const size_t available_gpu_memory = _get_available_gpu_memory(); + const size_t block_size_in_bytes = m_cache_manager->get_block_size_in_bytes(); + size_t required_memory = (new_blocks_num - current_num_of_kv_blocks) * block_size_in_bytes; if (required_memory <= available_gpu_memory) { m_block_manager->increase_kv_blocks_number(new_blocks_num); } else { - size_t possible_blocks_to_add = available_gpu_memory / device_config->get_block_size_in_bytes(); + size_t possible_blocks_to_add = available_gpu_memory / block_size_in_bytes; if (possible_blocks_to_add > 0) { m_block_manager->increase_kv_blocks_number(current_num_of_kv_blocks + possible_blocks_to_add); - } - else { + } else { return false; } } } return true; } - }; } diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp index bec2b75e0d..2ecdbd66f3 100644 --- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp +++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp @@ -5,7 +5,6 @@ namespace ov::genai { ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::ContinuousBatchingForSpeculativeDecodingImpl( - ov::Core& core, const std::shared_ptr& model, const Tokenizer& tokenizer, const GenerationConfig& generation_config, @@ -17,7 +16,7 @@ ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::Contin m_tokenizer = tokenizer; m_generation_config = generation_config; m_is_validation_mode_enabled = is_validation_mode_enabled; - initialize_pipeline(model, scheduler_config, plugin_config, device_config, core); + initialize_pipeline(model, scheduler_config, plugin_config, device_config); } void diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp index e4e4be63d8..b714316e75 100644 --- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp +++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp @@ -13,8 +13,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl : public: ContinuousBatchingForSpeculativeDecodingImpl() = default; - ContinuousBatchingForSpeculativeDecodingImpl(ov::Core& core, - const std::shared_ptr& model, + ContinuousBatchingForSpeculativeDecodingImpl(const std::shared_ptr& model, const Tokenizer& tokenizer, const GenerationConfig& generation_config, const DeviceConfig& device_config, diff --git a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp index ddb3d0ae10..32d13feed1 100644 --- a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp +++ b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp @@ -5,8 +5,8 @@ #include "text_callback_streamer.hpp" #include "speculative_decoding_impl.hpp" +#include "paged_attention_transformations.hpp" #include "utils.hpp" -#include "utils/paged_attention_transformations.hpp" namespace ov::genai { @@ -35,6 +35,7 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con utils::apply_paged_attention_transformations(main_model, main_model_desc.scheduler_config.use_cache_eviction); utils::apply_paged_attention_transformations(draft_model, main_model_desc.scheduler_config.use_cache_eviction); + utils::apply_gather_before_matmul_transformation(main_model); utils::apply_gather_before_matmul_transformation(draft_model); @@ -63,9 +64,7 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con ov::AnyMap draft_properties = draft_model_desc.properties.empty() ? main_model_desc.properties : draft_model_desc.properties; - ov::Core core = utils::singleton_core(); - DeviceConfig main_device_config(core, main_scheduler_config_updated, main_device, main_model_desc.properties), - draft_device_config(core, draft_scheduler_config, draft_device, draft_properties); + DeviceConfig main_device_config(main_device), draft_device_config(draft_device); utils::set_kv_cache_type_and_shape(main_model, main_device_config); utils::set_kv_cache_type_and_shape(draft_model, draft_device_config); @@ -81,10 +80,10 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con m_tokenizer = main_model_tokenizer; // to create `main_pipeline` with enabled validation_mode and `draft_pipeline` with disabled validation mode - m_main_pipeline = std::make_shared(core, + m_main_pipeline = std::make_shared( main_model, main_model_tokenizer, main_model_desc.generation_config, main_device_config, main_scheduler_config_updated, main_device, main_model_desc.properties, true); - m_draft_pipeline = std::make_shared(core, + m_draft_pipeline = std::make_shared( draft_model, draft_model_tokenizer, draft_model_desc.generation_config, draft_device_config, draft_scheduler_config, draft_device, draft_properties, false); } diff --git a/tests/cpp/CMakeLists.txt b/tests/cpp/CMakeLists.txt index d63ae17dcf..29e481cec3 100644 --- a/tests/cpp/CMakeLists.txt +++ b/tests/cpp/CMakeLists.txt @@ -20,15 +20,15 @@ file(GLOB src_files "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/sequence_group.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/sampler.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/speculative_decoding/*.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/prompt_lookup/*.cpp" - "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/utils/*.cpp" + "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/paged_attention_transformations.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/utils.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/continuous_batching*.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/icontinuous_batching.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/lora_helper.cpp" "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/text_callback_streamer.cpp") -add_executable(${TEST_TARGET_NAME} ${tests_src} - block_allocator.cpp) +add_executable(${TEST_TARGET_NAME} ${tests_src}) + target_link_libraries(${TEST_TARGET_NAME} PRIVATE openvino::genai gtest_main gmock_main) target_include_directories(${TEST_TARGET_NAME} PRIVATE "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src") target_sources(${TEST_TARGET_NAME} PRIVATE ${src_files}) diff --git a/tests/cpp/cache_manager.cpp b/tests/cpp/cache_manager.cpp index 0c483f0ec1..864a7b43af 100644 --- a/tests/cpp/cache_manager.cpp +++ b/tests/cpp/cache_manager.cpp @@ -7,37 +7,13 @@ #include "scheduler.hpp" #include "device_config.hpp" #include "cache_manager.hpp" -#include "openvino/op/concat.hpp" +#include "helper.hpp" using namespace ov::genai; -std::shared_ptr get_dummy_model(ov::Core core, size_t num_layers) { - ov::NodeVector keys; - ov::NodeVector values; - ov::ParameterVector params; - ov::element::Type inference_precision = core.get_property("CPU", ov::hint::inference_precision); - ov::element::Type kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16; - - auto shape = ov::PartialShape({ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic()}); - for (size_t i = 0; i < num_layers; i++) { - auto key = std::make_shared(kv_cache_type, shape); - auto value = std::make_shared(kv_cache_type, shape); - key->get_output_tensor(0).set_names({"key_cache." + std::to_string(i)}); - value->get_output_tensor(0).set_names({"value_cache." + std::to_string(i)}); - keys.push_back(key); - values.push_back(value); - params.push_back(key); - params.push_back(value); - } - const auto& concat1 = std::make_shared(keys, 1); - const auto& concat2 = std::make_shared(values, 1); - auto model = std::make_shared(ov::NodeVector{concat1, concat2}, params); - return std::make_shared(ov::NodeVector{concat1, concat2}, params); -} - -size_t get_total_allocated_bytes(std::shared_ptr cache_manager, size_t num_decoder_layers) { +size_t get_total_allocated_bytes(std::shared_ptr cache_manager) { size_t allocated_bytes = 0; - for (size_t i = 0; i < num_decoder_layers; i++) { + for (size_t i = 0; i < cache_manager->get_num_decoder_layers(); i++) { auto key_cache = cache_manager->get_key_cache(i); auto value_cache = cache_manager->get_value_cache(i); allocated_bytes += key_cache.get_byte_size() + value_cache.get_byte_size(); @@ -45,93 +21,98 @@ size_t get_total_allocated_bytes(std::shared_ptr cache_ return allocated_bytes; } +size_t get_num_kv_blocks(size_t cache_size, size_t block_size_bytes) { + size_t kv_cache_size_in_bytes = cache_size * 1024 * 1024 * 1024; // convert GBs to bytes + return kv_cache_size_in_bytes / block_size_bytes; +} TEST(TestCacheManager, test_cache_size_param) { ov::Core core; - ov::genai::SchedulerConfig scheduler_config; + SchedulerConfig scheduler_config; scheduler_config.max_num_batched_tokens = 32; scheduler_config.num_kv_blocks = 0; scheduler_config.cache_size = 2; scheduler_config.max_num_seqs = 2; const std::string device = "CPU"; - ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU"); + DeviceConfig device_config("CPU"); const size_t num_decoder_layers = 12; const std::vector kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); device_config.set_kv_head_configs(kv_heads_config); ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request(); - auto cache_manager = std::make_shared(device_config, request, core); - auto block_manager = BlockManager(device_config.get_num_kv_blocks(), false, device_config.get_block_size(), device_config.get_num_layers()); + auto cache_manager = std::make_shared(request, device_config); + ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers()); + const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, cache_manager->get_block_size_in_bytes()); + + auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers()); cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks()); - - ASSERT_EQ(get_total_allocated_bytes(cache_manager, num_decoder_layers), 2146959360); + + const size_t kv_cache_total_size = scheduler_config.cache_size * 1024 * 1024 * 1024; + const size_t cpu_block_size_total = cache_manager->get_block_size_in_bytes(); + size_t expected_size = kv_cache_total_size / cpu_block_size_total * cpu_block_size_total; + ASSERT_EQ(get_total_allocated_bytes(cache_manager), expected_size); } TEST(TestCacheManager, test_kv_blocks_param) { ov::Core core; - ov::genai::SchedulerConfig scheduler_config; + SchedulerConfig scheduler_config; scheduler_config.max_num_batched_tokens = 32; scheduler_config.num_kv_blocks = 150; scheduler_config.cache_size = 0; scheduler_config.max_num_seqs = 2; const std::string device = "CPU"; - ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU"); + DeviceConfig device_config("CPU"); const size_t num_decoder_layers = 12; const std::vector kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); device_config.set_kv_head_configs(kv_heads_config); - ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request(); - auto cache_manager = std::make_shared(device_config, request, core); - auto block_manager = BlockManager(device_config.get_num_kv_blocks(), false, device_config.get_block_size(), device_config.get_num_layers()); - OPENVINO_ASSERT(block_manager.get_total_number_of_kv_blocks(), scheduler_config.num_kv_blocks); + auto block_manager = BlockManager(scheduler_config.num_kv_blocks, false, device_config.get_block_size(), num_decoder_layers); + ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), scheduler_config.num_kv_blocks); } TEST(TestCacheManager, test_dynamic_cache_increase) { ov::Core core; - ov::genai::SchedulerConfig scheduler_config; + SchedulerConfig scheduler_config; scheduler_config.max_num_batched_tokens = 32; scheduler_config.num_kv_blocks = 0; scheduler_config.cache_size = 0; scheduler_config.max_num_seqs = 2; const std::string device = "CPU"; - ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU"); + DeviceConfig device_config("CPU"); const size_t num_decoder_layers = 12; const std::vector kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); device_config.set_kv_head_configs(kv_heads_config); - size_t block_size_in_bytes = 0; - for (size_t layer_id = 0; layer_id < num_decoder_layers; layer_id++) { - KVHeadConfig config = kv_heads_config[layer_id]; - block_size_in_bytes += config.k_head_size * config.num_k_heads + config.v_head_size * config.num_v_heads; - } - block_size_in_bytes *= device_config.get_block_size() * device_config.get_cache_precision().size(); - ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request(); - auto cache_manager = std::make_shared(device_config, request, core); - auto block_manager = BlockManager(device_config.get_num_kv_blocks(), false, device_config.get_block_size(), device_config.get_num_layers()); + auto cache_manager = std::make_shared(request, device_config); + size_t block_size_in_bytes = cache_manager->get_block_size_in_bytes(); + const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, block_size_in_bytes); + + auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers()); + ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers()); // check initial cache allocation block_manager.increase_kv_blocks_number(100); - OPENVINO_ASSERT(block_manager.get_total_number_of_kv_blocks(), 100); + ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), 100); cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks()); - OPENVINO_ASSERT(get_total_allocated_bytes(cache_manager, num_decoder_layers), 100 * block_size_in_bytes); + ASSERT_EQ(get_total_allocated_bytes(cache_manager), 100 * block_size_in_bytes); // check cache increase block_manager.increase_kv_blocks_number(200); - OPENVINO_ASSERT(block_manager.get_total_number_of_kv_blocks(), 200); + ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), 200); cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks()); - OPENVINO_ASSERT(get_total_allocated_bytes(cache_manager, num_decoder_layers), 200 * block_size_in_bytes); + ASSERT_EQ(get_total_allocated_bytes(cache_manager), 200 * block_size_in_bytes); // check that cache does not increase if new blocks were not allocated cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks()); - OPENVINO_ASSERT(get_total_allocated_bytes(cache_manager, num_decoder_layers), 200 * block_size_in_bytes); + ASSERT_EQ(get_total_allocated_bytes(cache_manager), 200 * block_size_in_bytes); } \ No newline at end of file diff --git a/tests/cpp/device_config.cpp b/tests/cpp/device_config.cpp deleted file mode 100644 index a97037b1e8..0000000000 --- a/tests/cpp/device_config.cpp +++ /dev/null @@ -1,33 +0,0 @@ -// Copyright (C) 2018-2025 Intel Corporation -// SPDX-License-Identifier: Apache-2.0 -// - -#include -#include "openvino/runtime/core.hpp" -#include "scheduler.hpp" -#include "device_config.hpp" - -TEST(TestDeviceConfig, kv_cache_precision_u8) { - ov::Core core; - ov::genai::SchedulerConfig scheduler_config; - scheduler_config.max_num_batched_tokens = 32; - scheduler_config.num_kv_blocks = 0; - scheduler_config.cache_size = 2; - scheduler_config.max_num_seqs = 2; - - const std::string device = "CPU"; - size_t num_decoder_layers = 12; - size_t head_size = 64, head_size_u8 = head_size + 8; - - ov::genai::KVHeadConfig kv_head_config { 12, 12, head_size_u8, head_size_u8 }; - ov::genai::KVHeadConfig kv_head_config_u8 { 12, 12, head_size, head_size }; - - ov::genai::DeviceConfig device_config_default(core, scheduler_config, "CPU"); - ov::genai::DeviceConfig device_config_u8(core, scheduler_config, "CPU", { ov::hint::kv_cache_precision(ov::element::u8) }); - - device_config_default.set_kv_head_configs(std::vector(num_decoder_layers, kv_head_config)); - device_config_u8.set_kv_head_configs(std::vector(num_decoder_layers, kv_head_config_u8)); - - const auto ratio = ov::element::f16.size() / ov::element::u8.size(); - ASSERT_EQ(device_config_default.get_num_kv_blocks() * ratio, device_config_u8.get_num_kv_blocks()); -} diff --git a/tests/cpp/helper.cpp b/tests/cpp/helper.cpp new file mode 100644 index 0000000000..da242da479 --- /dev/null +++ b/tests/cpp/helper.cpp @@ -0,0 +1,27 @@ +// Copyright (C) 2023-2024 Intel Corporation +// SPDX-License-Identifier: Apache-2.0 + +#include "helper.hpp" +#include "openvino/op/concat.hpp" + +std::shared_ptr get_dummy_model(ov::Core core, size_t num_layers) { + ov::NodeVector keys, values; + ov::ParameterVector params; + ov::element::Type kv_cache_type = core.get_property("CPU", ov::hint::kv_cache_precision); + + auto shape = ov::PartialShape::dynamic(4); + for (size_t i = 0; i < num_layers; i++) { + auto key = std::make_shared(kv_cache_type, shape); + auto value = std::make_shared(kv_cache_type, shape); + key->get_output_tensor(0).set_names({"key_cache." + std::to_string(i)}); + value->get_output_tensor(0).set_names({"value_cache." + std::to_string(i)}); + keys.push_back(key); + values.push_back(value); + params.push_back(key); + params.push_back(value); + } + const auto& concat1 = std::make_shared(keys, 1); + const auto& concat2 = std::make_shared(values, 1); + auto model = std::make_shared(ov::NodeVector{concat1, concat2}, params); + return std::make_shared(ov::NodeVector{concat1, concat2}, params); +} diff --git a/tests/cpp/helper.hpp b/tests/cpp/helper.hpp new file mode 100644 index 0000000000..1fafe8bcf6 --- /dev/null +++ b/tests/cpp/helper.hpp @@ -0,0 +1,8 @@ +// Copyright (C) 2023-2024 Intel Corporation +// SPDX-License-Identifier: Apache-2.0 + +#pragma once + +#include "openvino/runtime/core.hpp" + +std::shared_ptr get_dummy_model(ov::Core core, size_t num_layers); \ No newline at end of file diff --git a/tests/cpp/scheduler.cpp b/tests/cpp/scheduler.cpp index 201318347a..b6aa5a9b53 100644 --- a/tests/cpp/scheduler.cpp +++ b/tests/cpp/scheduler.cpp @@ -9,6 +9,7 @@ #include "openvino/genai/generation_config.hpp" #include "sequence_group.hpp" #include "scheduler.hpp" +#include "helper.hpp" using namespace ov::genai; @@ -18,39 +19,16 @@ void clear_finished_sequences(std::vector& requests) { }); requests.erase(new_end, requests.end()); } -std::shared_ptr get_model(ov::Core core, size_t num_layers) { - ov::NodeVector keys; - ov::NodeVector values; - ov::ParameterVector params; - ov::element::Type inference_precision = core.get_property("CPU", ov::hint::inference_precision); - ov::element::Type kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16; - - auto shape = ov::PartialShape({ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic()}); - for (size_t i = 0; i < num_layers; i++) { - auto key = std::make_shared(kv_cache_type, shape); - auto value = std::make_shared(kv_cache_type, shape); - key->get_output_tensor(0).set_names({"key_cache." + std::to_string(i)}); - value->get_output_tensor(0).set_names({"value_cache." + std::to_string(i)}); - keys.push_back(key); - values.push_back(value); - params.push_back(key); - params.push_back(value); - } - const auto& concat1 = std::make_shared(keys, 1); - const auto& concat2 = std::make_shared(values, 1); - auto model = std::make_shared(ov::NodeVector{concat1, concat2}, params); - return std::make_shared(ov::NodeVector{concat1, concat2}, params); -} std::shared_ptr init_cache_manager(SchedulerConfig scheduler_config) { ov::Core core = ov::Core(); size_t num_decoder_layers = 12; - ov::InferRequest request = core.compile_model(get_model(core, num_decoder_layers)).create_infer_request(); - size_t head_size = 64, head_size_u8 = head_size + 8; - std::vector kv_head_configs(num_decoder_layers, KVHeadConfig { 12, 12, head_size_u8, head_size_u8 }); - ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU"); + ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request(); + const size_t head_size = 64; + std::vector kv_head_configs(num_decoder_layers, KVHeadConfig { 12, 12, head_size, head_size }); + ov::genai::DeviceConfig device_config("CPU"); device_config.set_kv_head_configs(kv_head_configs); - return std::make_shared(device_config, request, core); + return std::make_shared(request, device_config); } TEST(TestScheduler, general_test) { diff --git a/tests/cpp/speculative_decoding.cpp b/tests/cpp/speculative_decoding.cpp index 1cf8db0fab..114f16800b 100644 --- a/tests/cpp/speculative_decoding.cpp +++ b/tests/cpp/speculative_decoding.cpp @@ -13,8 +13,7 @@ class CBForSDTest : public testing::Test, public ov::genai::ContinuousBatchingPi m_sampler = std::make_shared(); }; - ov::genai::GenerationHandle - add_request(uint64_t request_id, const ov::Tensor& input_ids) { + ov::genai::GenerationHandle add_request(uint64_t request_id, const ov::Tensor& input_ids) { auto sampling_params = ov::genai::greedy(); sampling_params.num_assistant_tokens = 1; From e866ec088bc0a89f307509160401390f816373d3 Mon Sep 17 00:00:00 2001 From: Ekaterina Aidova Date: Wed, 29 Jan 2025 17:10:19 +0400 Subject: [PATCH 04/15] [LLM bench]support providing adapter config mode (#1644) CVS-161355 --- tools/llm_bench/benchmark.py | 1 + .../llm_bench/llm_bench_utils/model_utils.py | 1 + tools/llm_bench/llm_bench_utils/ov_utils.py | 19 +++++++++++++++---- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/tools/llm_bench/benchmark.py b/tools/llm_bench/benchmark.py index d01c6316ff..3a4079d6b6 100644 --- a/tools/llm_bench/benchmark.py +++ b/tools/llm_bench/benchmark.py @@ -140,6 +140,7 @@ def get_argprser(): default=None, help="Path to LoRA adapters for using OpenVINO GenAI optimized pipelines with LoRA for benchmarking") parser.add_argument('--lora_alphas', nargs='*', help='Alphas params for LoRA adapters.', required=False, default=[]) + parser.add_argument("--lora_mode", choices=["auto", "fuse", "static", "static_rank", "dynamic"], help="LoRA adapters loading mode") parser.add_argument("--use_cb", action="store_true", help="Use Continuous Batching inference mode") parser.add_argument("--cb_config", required=False, default=None, help="Path to file with Continuous Batching Scheduler settings or dict") parser.add_argument("--draft_model", required=False, default=None, diff --git a/tools/llm_bench/llm_bench_utils/model_utils.py b/tools/llm_bench/llm_bench_utils/model_utils.py index aaf72113dc..4bd696a569 100644 --- a/tools/llm_bench/llm_bench_utils/model_utils.py +++ b/tools/llm_bench/llm_bench_utils/model_utils.py @@ -131,6 +131,7 @@ def analyze_args(args): model_args['output_dir'] = args.output_dir model_args['lora'] = args.lora model_args['lora_alphas'] = args.lora_alphas + model_args['lora_mode'] = args.lora_mode use_cb = args.use_cb or args.draft_model if args.device == "NPU" and use_cb: log.warning("Continious batching and Speculative Decoding are not supported for NPU device") diff --git a/tools/llm_bench/llm_bench_utils/ov_utils.py b/tools/llm_bench/llm_bench_utils/ov_utils.py index c70e4beb5e..eea3dd50f3 100644 --- a/tools/llm_bench/llm_bench_utils/ov_utils.py +++ b/tools/llm_bench/llm_bench_utils/ov_utils.py @@ -135,9 +135,20 @@ def decode_ov_tokenizer(self, token_ids, *args, **kwargs): return hf_tokenizer -def get_lora_config(lora_paths, lora_alphas): +def get_lora_config(lora_paths, lora_alphas, lora_mode=None): import openvino_genai + modes = { + "auto": openvino_genai.AdapterConfig.Mode.MODE_AUTO, + "fuse": openvino_genai.AdapterConfig.Mode.MODE_FUSE, + "dynamic": openvino_genai.AdapterConfig.Mode.MODE_DYNAMIC, + "static": openvino_genai.AdapterConfig.Mode.MODE_STATIC, + "static_rank": openvino_genai.AdapterConfig.Mode.MODE_DYNAMIC + } + if lora_mode is not None: + lora_mode = modes[lora_mode] + log.info(f"LoRA adapters loading mode: {lora_mode}") + adapter_config = list() if not lora_paths: return adapter_config @@ -150,7 +161,7 @@ def get_lora_config(lora_paths, lora_alphas): if not Path(lora_paths[idx]).exists(): log.warning(f'LoRA path is not exists: {lora_paths[idx]}. LoRA will be ignored.') continue - adapter_config = openvino_genai.AdapterConfig() + adapter_config = openvino_genai.AdapterConfig() if lora_mode is None else openvino_genai.AdapterConfig(mode=lora_mode) adapter = openvino_genai.Adapter(lora_paths[idx]) alpha = float(lora_alphas[idx]) adapter_config.add(adapter, alpha) @@ -263,7 +274,7 @@ def create_genai_text_gen_model(model_path, device, ov_config, **kwargs): if kwargs.get("draft_cb_config") is not None else {} ov_config['draft_model'] = openvino_genai.draft_model(draft_model_path, draft_device.upper(), **draft_model_load_kwargs) - adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", [])) + adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", []), kwargs.get("lora_mode", None)) if adapter_config: ov_config['adapters'] = adapter_config @@ -413,7 +424,7 @@ def get_unet_step_count(self): def get_vae_decoder_step_count(self): return 1 - adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", [])) + adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", []), kwargs.get("lora_mode", None)) if adapter_config: ov_config['adapters'] = adapter_config From 106e56126be652e18998762d05eafb2aa681315d Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Wed, 29 Jan 2025 17:41:29 +0400 Subject: [PATCH 05/15] [WWB]: Fixed chat template usage in VLM GenAI pipeline (#1643) Co-authored-by: Ilya Lavrenov --- tools/who_what_benchmark/whowhatbench/wwb.py | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py index 7d4354f846..1eb778a060 100644 --- a/tools/who_what_benchmark/whowhatbench/wwb.py +++ b/tools/who_what_benchmark/whowhatbench/wwb.py @@ -337,11 +337,9 @@ def genai_gen_visual_text(model, prompt, image, processor, tokenizer, max_new_to config.max_new_tokens = max_new_tokens config.do_sample = False model.set_generation_config(config) - if tokenizer.chat_template is not None: - model.start_chat(tokenizer.chat_template) - else: - model.start_chat() - out = model.generate(prompt, images=[image_data]) + + model.start_chat() + out = model.generate(prompt, image=image_data) model.finish_chat() return out.texts[0] From 020bdabcb6a9fab889f97c43cf59122a9fdf9c88 Mon Sep 17 00:00:00 2001 From: Sofya Balandina Date: Wed, 29 Jan 2025 13:53:56 +0000 Subject: [PATCH 06/15] Automatically apply chat template in non-chat scenarios (#1533) [CVS-157276](https://jira.devtools.intel.com/browse/CVS-157276) --- .github/workflows/causal_lm_cpp.yml | 48 ++++++++++++++----- README.md | 1 - samples/cpp/text_generation/README.md | 2 +- samples/python/text_generation/README.md | 2 +- src/README.md | 2 + .../openvino/genai/generation_config.hpp | 7 +++ .../include/openvino/genai/llm_pipeline.hpp | 4 ++ src/cpp/include/openvino/genai/tokenizer.hpp | 3 ++ .../genai/visual_language/pipeline.hpp | 8 ++++ .../genai/whisper_generation_config.hpp | 2 +- src/cpp/src/generation_config.cpp | 1 + src/cpp/src/icontinuous_batching.cpp | 16 ++++++- src/cpp/src/llm_pipeline_stateful.cpp | 26 ++++++++-- src/cpp/src/llm_pipeline_static.cpp | 20 +++++++- src/cpp/src/tokenizer.cpp | 8 ++++ .../src/visual_language/inputs_embedder.cpp | 29 ++++++++++- .../src/visual_language/inputs_embedder.hpp | 3 ++ src/cpp/src/visual_language/pipeline.cpp | 2 + src/cpp/src/whisper_generation_config.cpp | 6 +++ .../openvino_genai/py_openvino_genai.pyi | 5 ++ src/python/py_generation_config.cpp | 2 + src/python/py_tokenizer.cpp | 6 +++ tests/python_tests/common.py | 18 +++++-- tests/python_tests/test_generation_config.py | 2 + tests/python_tests/test_llm_pipeline.py | 8 ++-- tests/python_tests/test_sampling.py | 4 +- tools/llm_bench/task/text_generation.py | 2 + .../task/visual_language_generation.py | 1 + tools/who_what_benchmark/whowhatbench/wwb.py | 3 +- 29 files changed, 207 insertions(+), 34 deletions(-) diff --git a/.github/workflows/causal_lm_cpp.yml b/.github/workflows/causal_lm_cpp.yml index 2e0afaa882..5dff0a58d3 100644 --- a/.github/workflows/causal_lm_cpp.yml +++ b/.github/workflows/causal_lm_cpp.yml @@ -120,7 +120,10 @@ jobs: with open('pred.txt', 'r') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0') - tokenized = tokenizer('Why is the Sun yellow?', return_tensors='pt') + prompt = 'Why is the Sun yellow?' + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False): ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref) @@ -136,7 +139,10 @@ jobs: with open('pred.txt', 'r') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0') - tokenized = tokenizer('69', return_tensors='pt') + prompt = '69' + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False): ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref) @@ -152,7 +158,10 @@ jobs: with open('pred.txt', 'r') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0') - tokenized = tokenizer('Hi', return_tensors='pt') + prompt = 'Hi' + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False): ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref) @@ -168,7 +177,10 @@ jobs: with open('pred.txt', 'r') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0') - tokenized = tokenizer('return 0', return_tensors='pt') + prompt = 'return 0' + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False): ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref) @@ -184,7 +196,10 @@ jobs: with open('pred.txt', 'r', errors='ignore') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0') - tokenized = tokenizer('你好! 你好嗎?', return_tensors='pt') + prompt = '你好! 你好嗎?' + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False): ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref.replace('�', '')) @@ -194,19 +209,21 @@ jobs: " echo "你好! 你好嗎?" passed - timeout 1m ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Alan Turing was a" "return 0" "你好! 你好嗎?" > ./pred.txt + timeout 1m ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Why is the Sun yellow?" "return 0" "你好! 你好嗎?" > ./pred.txt python -c " import transformers with open('pred.txt', 'r', errors='ignore') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0') prompts = [ - 'Alan Turing was a', + 'Why is the Sun yellow?', 'return 0', '你好! 你好嗎?' ] for prompt in prompts: - tokenized = tokenizer(prompt, return_tensors='pt') + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False): ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref.replace('�', '')) @@ -255,7 +272,10 @@ jobs: echo import transformers > ref.py echo predictions = open('cpp.txt', 'r').read() >> ref.py echo tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0', trust_remote_code=True) >> ref.py - echo tokenized = tokenizer('69', return_tensors='pt') >> ref.py + echo prompt = '69' >> ref.py + echo if tokenizer.chat_template: >> ref.py + echo prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) >> ref.py + echo tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) >> ref.py echo for beam in transformers.AutoModelForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0', trust_remote_code=True).generate(**tokenized, max_new_tokens=100, do_sample=False): >> ref.py echo ref = tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) >> ref.py echo idx = predictions.find(ref) >> ref.py @@ -562,7 +582,10 @@ jobs: with open('pred_greedy.txt', 'r') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/phi-1_5') - tokenized = tokenizer('Alan Turing was a', return_tensors='pt') + prompt = 'Alan Turing was a' + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for output in transformers.AutoModelForCausalLM.from_pretrained('microsoft/phi-1_5').generate(**tokenized, max_length=100, do_sample=False): ref = tokenizer.decode(output[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref) @@ -617,7 +640,10 @@ jobs: with open('pred_greedy.txt', 'r') as file: predictions = file.read() tokenizer = transformers.AutoTokenizer.from_pretrained('ikala/redpajama-3b-chat') - tokenized = tokenizer('Alan Turing was a', return_tensors='pt') + prompt = 'Alan Turing was a' + if tokenizer.chat_template: + prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) for output in transformers.AutoModelForCausalLM.from_pretrained('ikala/redpajama-3b-chat').generate(**tokenized, max_length=100, do_sample=False): ref = tokenizer.decode(output[tokenized['input_ids'].numel():], skip_special_tokens=True) idx = predictions.find(ref) diff --git a/README.md b/README.md index cea1e358bc..221a81c6c3 100644 --- a/README.md +++ b/README.md @@ -133,7 +133,6 @@ from PIL import Image # Choose GPU instead of CPU in the line below to run the model on Intel integrated or discrete GPU pipe = openvino_genai.VLMPipeline("./InternVL2-1B", "CPU") -pipe.start_chat() image = Image.open("dog.jpg") image_data = np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.uint8) diff --git a/samples/cpp/text_generation/README.md b/samples/cpp/text_generation/README.md index dd24b6ebf5..d20d8ac09d 100644 --- a/samples/cpp/text_generation/README.md +++ b/samples/cpp/text_generation/README.md @@ -48,7 +48,7 @@ Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat ./chat_sample ``` #### Missing chat template -If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model. +If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model or update it using call `pipe.get_tokenizer().set_chat_template(new_chat_template)`. The following template can be used as a default, but it may not work properly with every model: ``` "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}", diff --git a/samples/python/text_generation/README.md b/samples/python/text_generation/README.md index 97a6ad59bc..6b086f3471 100644 --- a/samples/python/text_generation/README.md +++ b/samples/python/text_generation/README.md @@ -48,7 +48,7 @@ Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat python chat_sample.py model_dir ``` #### Missing chat template -If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model. +If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model or update it using call `pipe.get_tokenizer().set_chat_template(new_chat_template)`. The following template can be used as a default, but it may not work properly with every model: ``` "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}", diff --git a/src/README.md b/src/README.md index af4953f98a..c2ed8c2a60 100644 --- a/src/README.md +++ b/src/README.md @@ -73,6 +73,8 @@ output: 'it is made up of carbon atoms. The carbon atoms are arranged in a linear pattern, which gives the yellow color. The arrangement of carbon atoms in' ``` +>**Note**: The chat_template from tokenizer_config.json or from tokenizer/detokenizer model will be automatically applied to the prompt at the generation stage. If you want to disable it, you can do it by calling pipe.get_tokenizer().set_chat_template(""). + A simple chat in Python: ```python import openvino_genai as ov_genai diff --git a/src/cpp/include/openvino/genai/generation_config.hpp b/src/cpp/include/openvino/genai/generation_config.hpp index 3a75fc02ea..13cc8f0b01 100644 --- a/src/cpp/include/openvino/genai/generation_config.hpp +++ b/src/cpp/include/openvino/genai/generation_config.hpp @@ -77,6 +77,8 @@ enum class StopCriteria { EARLY, HEURISTIC, NEVER }; * @param assistant_confidence_threshold the lower token probability of candidate to be validated by main model in case of dynamic strategy candidates number update. * @param num_assistant_tokens the defined candidates number to be generated by draft model/prompt lookup in case of static strategy candidates number update. * @param max_ngram_size is maximum ngram to use when looking for matches in the prompt. + * + * @param apply_chat_template whether or not to apply chat_template for non-chat scenarios */ class OPENVINO_GENAI_EXPORTS GenerationConfig { @@ -128,6 +130,9 @@ class OPENVINO_GENAI_EXPORTS GenerationConfig { std::optional adapters; + // set to true if chat template should be applied for non-chat scenarios, set to false otherwise + bool apply_chat_template = true; + /** @brief sets eos_token_id to tokenizer_eos_token_id if eos_token_id is less than 0. * Otherwise verifies eos_token_id == tokenizer_eos_token_id. */ @@ -189,6 +194,8 @@ extern OPENVINO_GENAI_EXPORTS ov::Property rng_seed; static constexpr ov::Property assistant_confidence_threshold{"assistant_confidence_threshold"}; static constexpr ov::Property num_assistant_tokens{"num_assistant_tokens"}; +static constexpr ov::Property apply_chat_template{"apply_chat_template"}; + // Predefined Configs OPENVINO_DEPRECATED("Please, use individual parameters instead of predefined configs. This method will be removed in 2026.0.0 release") diff --git a/src/cpp/include/openvino/genai/llm_pipeline.hpp b/src/cpp/include/openvino/genai/llm_pipeline.hpp index 31b1ac1675..26232574dc 100644 --- a/src/cpp/include/openvino/genai/llm_pipeline.hpp +++ b/src/cpp/include/openvino/genai/llm_pipeline.hpp @@ -177,6 +177,8 @@ class OPENVINO_GENAI_EXPORTS LLMPipeline { * @param generation_config optional GenerationConfig * @param streamer optional streamer * @return DecodedResults decoded resulting text + * chat_template will be applied to the prompt, run pipe.get_tokenizer().set_chat_template(custom_chat_template) to update it. + * To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false. */ DecodedResults generate( StringInputs inputs, @@ -191,6 +193,8 @@ class OPENVINO_GENAI_EXPORTS LLMPipeline { * @param inputs input prompt or a vector of prompts * @param properties properties * @return DecodedResults decoded resulting text + * chat_template will be applied to the prompt, run pipe.get_tokenizer().set_chat_template(custom_chat_template) to update it. + * To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false. */ template util::EnableIfAllStringAny generate( diff --git a/src/cpp/include/openvino/genai/tokenizer.hpp b/src/cpp/include/openvino/genai/tokenizer.hpp index 0a54d1da2a..bde4eb3fe1 100644 --- a/src/cpp/include/openvino/genai/tokenizer.hpp +++ b/src/cpp/include/openvino/genai/tokenizer.hpp @@ -221,6 +221,9 @@ class OPENVINO_GENAI_EXPORTS Tokenizer { /// @param chat_template The new template to override with. void set_chat_template(const std::string& chat_template); + // get information about a chat template to check its status, for example whether it is empty + std::string get_chat_template() const; + // information about , tokens should be public, // they are used at least in StreamerBase descendants int64_t get_bos_token_id() const; diff --git a/src/cpp/include/openvino/genai/visual_language/pipeline.hpp b/src/cpp/include/openvino/genai/visual_language/pipeline.hpp index 8c3d380b0f..b6b1d5c7f6 100644 --- a/src/cpp/include/openvino/genai/visual_language/pipeline.hpp +++ b/src/cpp/include/openvino/genai/visual_language/pipeline.hpp @@ -98,6 +98,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline { /// @param generation_config A config to follow for text generation. /// @param streamer A streamer to acquire intermediate result. /// @return A string generated by a model. + /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it. + /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false. VLMDecodedResults generate( const std::string& prompt, const std::vector& rgbs, @@ -111,6 +113,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline { /// @param generation_config A config to follow for text generation. /// @param streamer A streamer to acquire intermediate result. /// @return A string generated by a model. + /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it. + /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false. VLMDecodedResults generate( const std::string& prompt, const ov::Tensor& rgb, @@ -124,6 +128,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline { /// for its members, StreamerVariant a single image or multiple /// images. /// @return A string generated by a model. + /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it. + /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false. VLMDecodedResults generate( const std::string& prompt, const ov::AnyMap& config_map @@ -137,6 +143,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline { /// @param ...properties ov::Property instances to be combined into /// ov::AnyMap. /// @return A string generated by a model. + /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it. + /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false. template util::EnableIfAllStringAny generate( const std::string& prompt, diff --git a/src/cpp/include/openvino/genai/whisper_generation_config.hpp b/src/cpp/include/openvino/genai/whisper_generation_config.hpp index 18b4202609..db92f2bcc4 100644 --- a/src/cpp/include/openvino/genai/whisper_generation_config.hpp +++ b/src/cpp/include/openvino/genai/whisper_generation_config.hpp @@ -18,7 +18,7 @@ namespace genai { */ class OPENVINO_GENAI_EXPORTS WhisperGenerationConfig : public GenerationConfig { public: - WhisperGenerationConfig() = default; + WhisperGenerationConfig(); explicit WhisperGenerationConfig(const std::filesystem::path& json_path); // Corresponds to the ”<|startoftranscript|>” token. diff --git a/src/cpp/src/generation_config.cpp b/src/cpp/src/generation_config.cpp index de23852c9b..3914e217c4 100644 --- a/src/cpp/src/generation_config.cpp +++ b/src/cpp/src/generation_config.cpp @@ -128,6 +128,7 @@ void GenerationConfig::update_generation_config(const ov::AnyMap& properties) { read_anymap_param(properties, "logprobs", logprobs); read_anymap_param(properties, "num_return_sequences", num_return_sequences); read_anymap_param(properties, "adapters", adapters); + read_anymap_param(properties, "apply_chat_template", apply_chat_template); // penalties read_anymap_param(properties, "frequency_penalty", frequency_penalty); diff --git a/src/cpp/src/icontinuous_batching.cpp b/src/cpp/src/icontinuous_batching.cpp index 78f8fda8f7..5bdf00d51d 100644 --- a/src/cpp/src/icontinuous_batching.cpp +++ b/src/cpp/src/icontinuous_batching.cpp @@ -53,9 +53,21 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate( } else { input_ids.reserve(prompts.size()); timer.start(); - for (const std::string& prompt : prompts) { + for (size_t i = 0; i < prompts.size(); i++) { + const std::string& prompt = prompts.at(i); const auto encode_start = std::chrono::steady_clock::now(); - input_ids.push_back(m_tokenizer.encode(prompt).input_ids); + ov::Tensor encoded_inputs; + if (sampling_params.at(i).apply_chat_template && !m_tokenizer.get_chat_template().empty()) { + ChatHistory history({{{"role", "user"}, {"content", prompt}}}); + constexpr bool add_generation_prompt = true; + auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt); + encoded_inputs = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)).input_ids; + } else { + // in case when chat_template was not found in tokenizer_config.json or set + std::string input_str(prompt); + encoded_inputs = m_tokenizer.encode(input_str, ov::genai::add_special_tokens(true)).input_ids; + } + input_ids.push_back(encoded_inputs); tokenization_durations.emplace_back(PerfMetrics::get_microsec(std::chrono::steady_clock::now() - encode_start)); } timer.end(); diff --git a/src/cpp/src/llm_pipeline_stateful.cpp b/src/cpp/src/llm_pipeline_stateful.cpp index 2a53154c27..0dea53c7ed 100644 --- a/src/cpp/src/llm_pipeline_stateful.cpp +++ b/src/cpp/src/llm_pipeline_stateful.cpp @@ -88,7 +88,18 @@ DecodedResults StatefulLLMPipeline::generate( if (auto input_vector = std::get_if>(&inputs)) { OPENVINO_ASSERT(!is_chat_conversation, "Can't chat with multiple prompts"); - encoded_input = m_tokenizer.encode(*input_vector); + if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) { + std::vector templated_input_vector; + for (auto& input : *input_vector) { + ChatHistory history({{{"role", "user"}, {"content", input}}}); + constexpr bool add_generation_prompt = true; + auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt); + templated_input_vector.push_back(templated_prompt); + } + encoded_input = m_tokenizer.encode(templated_input_vector, ov::genai::add_special_tokens(false)); + } else { + encoded_input = m_tokenizer.encode(*input_vector, ov::genai::add_special_tokens(true)); + } } else if (auto input_prompt = std::get_if(&inputs)) { std::string& prompt = *input_prompt; @@ -104,7 +115,7 @@ DecodedResults StatefulLLMPipeline::generate( m_history.push_back({{"role", "user"}, {"content", prompt}}); constexpr bool add_generation_prompt = true; - auto new_templated_chat_history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt); + auto new_templated_chat_history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt); // Do not add special tokens in chat scenario to be aligned with HF. auto new_chat_tokens = m_tokenizer.encode(new_templated_chat_history, ov::genai::add_special_tokens(false)); auto prev_chat_tokens = m_tokenizer.encode(m_templated_chat_history, ov::genai::add_special_tokens(false)); @@ -157,7 +168,16 @@ DecodedResults StatefulLLMPipeline::generate( // TODO: Forbid LoRA config change if we are in the chat mode, because it requires regenerating the history with LoRA applied } else { - encoded_input = m_tokenizer.encode(prompt); + std::string& prompt = *input_prompt; + if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) { + ChatHistory history({{{"role", "user"}, {"content", prompt}}}); + constexpr bool add_generation_prompt = true; + auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt); + encoded_input = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)); + } else { + // in case when chat_template was not found in tokenizer_config.json or set + encoded_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(true)); + } } } diff --git a/src/cpp/src/llm_pipeline_static.cpp b/src/cpp/src/llm_pipeline_static.cpp index b17ee959c5..47426d1cf4 100644 --- a/src/cpp/src/llm_pipeline_static.cpp +++ b/src/cpp/src/llm_pipeline_static.cpp @@ -827,7 +827,15 @@ DecodedResults StatefulLLMPipeline::generate( // for chat ov::genai::add_special_tokens(false) is aligned with stateful pipeline and HF tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(false)); } else { - tokenized_input = m_tokenizer.encode(prompt); + if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) { + ChatHistory history({{{"role", "user"}, {"content", prompt}}}); + constexpr bool add_generation_prompt = true; + auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt); + tokenized_input = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)); + } else { + // in case when chat_template was not found in tokenizer_config.json or set + tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(true)); + } } auto encode_stop_time = std::chrono::steady_clock::now(); @@ -1294,7 +1302,15 @@ DecodedResults StatelessLLMPipeline::generate( // for chat ov::genai::add_special_tokens(false) is aligned with stateful pipeline and HF tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(false)); } else { - tokenized_input = m_tokenizer.encode(prompt); + if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) { + ChatHistory history({{{"role", "user"}, {"content", prompt}}}); + constexpr bool add_generation_prompt = true; + auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt); + tokenized_input = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)); + } else { + // in case when chat_template was not found in tokenizer_config.json or set + tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(true)); + } } auto encode_stop_time = std::chrono::steady_clock::now(); diff --git a/src/cpp/src/tokenizer.cpp b/src/cpp/src/tokenizer.cpp index 9676cdb5f3..2eadda53ba 100644 --- a/src/cpp/src/tokenizer.cpp +++ b/src/cpp/src/tokenizer.cpp @@ -573,6 +573,10 @@ class Tokenizer::TokenizerImpl { void set_chat_template(const std::string& chat_template) { m_chat_template = patch_chat_template(chat_template); } + + std::string get_chat_template() { + return m_chat_template; + } }; Tokenizer::Tokenizer(const std::filesystem::path& tokenizer_path, const ov::AnyMap& properties) { @@ -676,6 +680,10 @@ std::string Tokenizer::apply_chat_template(ChatHistory history, return m_pimpl->apply_chat_template(history, add_generation_prompt, chat_template); } +std::string Tokenizer::get_chat_template() const { + return m_pimpl->get_chat_template(); +} + void Tokenizer::set_chat_template(const std::string& chat_template) { m_pimpl->set_chat_template(chat_template); } diff --git a/src/cpp/src/visual_language/inputs_embedder.cpp b/src/cpp/src/visual_language/inputs_embedder.cpp index 66b17e5804..e912570f20 100644 --- a/src/cpp/src/visual_language/inputs_embedder.cpp +++ b/src/cpp/src/visual_language/inputs_embedder.cpp @@ -43,6 +43,8 @@ class InputsEmbedder::IInputsEmbedder { // If we use beam search sampling with chat mode we need to remove last answer of the model from kv cache and add best answer to history // so, let's keep info about amount of tokens to trim from kv cache and amount of tokens to keep in history ov::genai::utils::HistoryRemoveManager m_kv_history_manager = {0, 0}; + // True if chat template should be applied for non-chat scenario + bool m_apply_chat_template = true; public: virtual ov::Tensor get_inputs_embeds(const std::string& prompt, const std::vector& images, ov::genai::VLMPerfMetrics& metrics) = 0; @@ -82,6 +84,10 @@ class InputsEmbedder::IInputsEmbedder { std::copy(encoded_result.begin(), encoded_result.end(), std::back_inserter(m_tokenized_history)); } + void set_apply_chat_template_status(bool apply_chat_template) { + m_apply_chat_template = apply_chat_template; + } + virtual void start_chat(const std::string& system_message) { m_is_chat_conversation = true; m_kv_history_manager.reset(); @@ -155,7 +161,7 @@ class InputsEmbedder::IInputsEmbedder { m_history.push_back({{"role", "user"}, {"content", prompt}}); constexpr bool add_generation_prompt = true; std::string new_templated_chat_history; - try { + try { new_templated_chat_history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt); } catch (const std::exception& error) { // Use fallback chat template if it was not found in tokenizer_config.json @@ -169,8 +175,23 @@ class InputsEmbedder::IInputsEmbedder { m_templated_chat_history = std::move(new_templated_chat_history); return {new_chat_tokens, prev_chat_tokens}; } else { + ov::Tensor encoded_input_ids; auto start_tokenizer_time = std::chrono::steady_clock::now(); - ov::Tensor encoded_input_ids = m_tokenizer.encode(prompt).input_ids; + if (m_apply_chat_template) { + std::string templated_prompt; + ChatHistory history({{{"role", "user"}, {"content", prompt}}}); + constexpr bool add_generation_prompt = true; + + if (!m_tokenizer.get_chat_template().empty()) { + templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt); + } else { + // Use fallback chat template if it was not found in tokenizer_config.json + templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt, chat_template_fallback); + } + encoded_input_ids = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)).input_ids; + } else { + encoded_input_ids = m_tokenizer.encode(prompt).input_ids; + } auto end_tokenizer_time = std::chrono::steady_clock::now(); metrics.raw_metrics.tokenization_durations.emplace_back(PerfMetrics::get_microsec(end_tokenizer_time - start_tokenizer_time)); return {encoded_input_ids, ov::Tensor()}; @@ -2046,6 +2067,10 @@ void InputsEmbedder::update_chat_history(const std::string& decoded_results) { return m_impl->update_chat_history(decoded_results); } +void InputsEmbedder::set_apply_chat_template_status(bool apply_chat_template) { + return m_impl->set_apply_chat_template_status(apply_chat_template); +} + void InputsEmbedder::finish_chat() { return m_impl->finish_chat(); } diff --git a/src/cpp/src/visual_language/inputs_embedder.hpp b/src/cpp/src/visual_language/inputs_embedder.hpp index 4462c58185..5bd7cd3004 100644 --- a/src/cpp/src/visual_language/inputs_embedder.hpp +++ b/src/cpp/src/visual_language/inputs_embedder.hpp @@ -58,6 +58,9 @@ class InputsEmbedder { // adds currently generated text to chat history void update_chat_history(const std::string& decoded_results); + // set the apply_chat_template flag, which determines whether chat template should be applied for non-chat scenarios + void set_apply_chat_template_status(bool apply_chat_template); + // finishes chat and clears a chat history void finish_chat(); private: diff --git a/src/cpp/src/visual_language/pipeline.cpp b/src/cpp/src/visual_language/pipeline.cpp index 95e3064548..a3f9859384 100644 --- a/src/cpp/src/visual_language/pipeline.cpp +++ b/src/cpp/src/visual_language/pipeline.cpp @@ -165,6 +165,8 @@ class ov::genai::VLMPipeline::VLMPipelineImpl { generation_config.set_eos_token_id(m_generation_config.eos_token_id); generation_config.validate(); + m_inputs_embedder->set_apply_chat_template_status(generation_config.apply_chat_template); + auto start_get_inputs_embeds = std::chrono::steady_clock::now(); ov::Tensor inputs_embeds = m_inputs_embedder->get_inputs_embeds(prompt, rgbs, perf_metrics); auto end_get_inputs_embeds = std::chrono::steady_clock::now(); diff --git a/src/cpp/src/whisper_generation_config.cpp b/src/cpp/src/whisper_generation_config.cpp index ec12170cf9..64bcd3e359 100644 --- a/src/cpp/src/whisper_generation_config.cpp +++ b/src/cpp/src/whisper_generation_config.cpp @@ -14,6 +14,10 @@ namespace ov { namespace genai { +WhisperGenerationConfig::WhisperGenerationConfig() { + apply_chat_template = false; +} + WhisperGenerationConfig::WhisperGenerationConfig(const std::filesystem::path& json_path) : GenerationConfig::GenerationConfig(json_path) { using ov::genai::utils::read_json_param; @@ -38,6 +42,8 @@ WhisperGenerationConfig::WhisperGenerationConfig(const std::filesystem::path& js } read_json_param(data, "lang_to_id", lang_to_id); + + apply_chat_template = false; } void WhisperGenerationConfig::update_generation_config(const ov::AnyMap& config_map) { diff --git a/src/python/openvino_genai/py_openvino_genai.pyi b/src/python/openvino_genai/py_openvino_genai.pyi index f1898d1232..62f3fb6060 100644 --- a/src/python/openvino_genai/py_openvino_genai.pyi +++ b/src/python/openvino_genai/py_openvino_genai.pyi @@ -550,6 +550,7 @@ class GenerationConfig: echo: if set to true, the model will echo the prompt in the output. logprobs: number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned. Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0). + apply_chat_template: whether to apply chat_template for non-chat scenarios repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty. presence_penalty: reduces absolute log prob if the token was generated at least once. @@ -578,6 +579,7 @@ class GenerationConfig: num_return_sequences: the number of sequences to generate from a single prompt. """ adapters: AdapterConfig | None + apply_chat_template: bool assistant_confidence_threshold: float diversity_penalty: float do_sample: bool @@ -996,6 +998,7 @@ class LLMPipeline: echo: if set to true, the model will echo the prompt in the output. logprobs: number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned. Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0). + apply_chat_template: whether to apply chat_template for non-chat scenarios repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty. presence_penalty: reduces absolute log prob if the token was generated at least once. @@ -1081,6 +1084,7 @@ class LLMPipeline: echo: if set to true, the model will echo the prompt in the output. logprobs: number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned. Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0). + apply_chat_template: whether to apply chat_template for non-chat scenarios repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty. presence_penalty: reduces absolute log prob if the token was generated at least once. @@ -1653,6 +1657,7 @@ class Tokenizer: openvino_genai.Tokenizer object is used to initialize Tokenizer if it's located in a different path than the main model. """ + chat_template: str def __init__(self, tokenizer_path: os.PathLike, properties: dict[str, typing.Any] = {}, **kwargs) -> None: ... def apply_chat_template(self, history: list[dict[str, str]], add_generation_prompt: bool, chat_template: str = '') -> str: diff --git a/src/python/py_generation_config.cpp b/src/python/py_generation_config.cpp index e2a6d7062c..a2c77589db 100644 --- a/src/python/py_generation_config.cpp +++ b/src/python/py_generation_config.cpp @@ -47,6 +47,7 @@ char generation_config_docstring[] = R"( echo: if set to true, the model will echo the prompt in the output. logprobs: number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned. Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0). + apply_chat_template: whether to apply chat_template for non-chat scenarios repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty. presence_penalty: reduces absolute log prob if the token was generated at least once. @@ -115,6 +116,7 @@ void init_generation_config(py::module_& m) { .def_readwrite("include_stop_str_in_output", &GenerationConfig::include_stop_str_in_output) .def_readwrite("stop_token_ids", &GenerationConfig::stop_token_ids) .def_readwrite("adapters", &GenerationConfig::adapters) + .def_readwrite("apply_chat_template", &GenerationConfig::apply_chat_template) .def("set_eos_token_id", &GenerationConfig::set_eos_token_id, py::arg("tokenizer_eos_token_id")) .def("is_beam_search", &GenerationConfig::is_beam_search) .def("is_greedy_decoding", &GenerationConfig::is_greedy_decoding) diff --git a/src/python/py_tokenizer.cpp b/src/python/py_tokenizer.cpp index 0dd9f3d715..5d8640b9d5 100644 --- a/src/python/py_tokenizer.cpp +++ b/src/python/py_tokenizer.cpp @@ -109,6 +109,12 @@ void init_tokenizer(py::module_& m) { "Override a chat_template read from tokenizer_config.json." ) + .def_property( + "chat_template", + &Tokenizer::get_chat_template, + &Tokenizer::set_chat_template + ) + .def("get_pad_token_id", &Tokenizer::get_pad_token_id) .def("get_bos_token_id", &Tokenizer::get_bos_token_id) .def("get_eos_token_id", &Tokenizer::get_eos_token_id) diff --git a/tests/python_tests/common.py b/tests/python_tests/common.py index b0b6a70e93..320f1e1a6a 100644 --- a/tests/python_tests/common.py +++ b/tests/python_tests/common.py @@ -252,7 +252,12 @@ def run_hugging_face( # process prompt by promp as we have multiple generation configs for prompt, generation_config in zip(prompts, generation_configs): hf_generation_config = convert_to_hf(opt_model.generation_config, generation_config) - inputs = hf_tokenizer(prompt, return_tensors="pt") + inputs = {} + if hf_tokenizer.chat_template and generation_config.apply_chat_template: + prompt = hf_tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) + inputs = hf_tokenizer(prompt, return_tensors="pt", add_special_tokens=False) + else: + inputs = hf_tokenizer(prompt, return_tensors="pt") input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask'] prompt_len = 0 if generation_config.echo else input_ids.numel() @@ -266,8 +271,15 @@ def run_hugging_face( generation_result.m_scores = [score for score in generate_outputs.sequences_scores] generation_results.append(generation_result) else: - # process all prompts as a single batch as we have a single generation config for all prompts - inputs = hf_tokenizer(prompts, return_tensors='pt', padding=True, truncation=True, add_special_tokens=True, padding_side='left') + inputs = {} + if hf_tokenizer.chat_template and generation_configs.apply_chat_template: + processed_prompts = [] + for prompt in prompts: + processed_prompts.append(hf_tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)) + # process all prompts as a single batch as we have a single generation config for all prompts + inputs = hf_tokenizer(processed_prompts, return_tensors='pt', padding=True, truncation=True, add_special_tokens=False, padding_side='left') + else: + inputs = hf_tokenizer(prompts, return_tensors='pt', padding=True, truncation=True, padding_side='left') input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask'] hf_generation_config = convert_to_hf(opt_model.generation_config, generation_configs) hf_encoded_outputs = opt_model.generate(input_ids, attention_mask=attention_mask, generation_config=hf_generation_config, tokenizer=hf_tokenizer) diff --git a/tests/python_tests/test_generation_config.py b/tests/python_tests/test_generation_config.py index 72da672713..c204ac7ecf 100644 --- a/tests/python_tests/test_generation_config.py +++ b/tests/python_tests/test_generation_config.py @@ -58,6 +58,8 @@ def verify_set_values(generation_config, kwargs): dict(max_new_tokens=1, assistant_confidence_threshold=0.5), dict(max_new_tokens=1, num_assistant_tokens=2), dict(max_new_tokens=1, num_assistant_tokens=2, max_ngram_size=2), # prompt lookup + dict(max_new_tokens=1, apply_chat_template=True), + dict(max_new_tokens=1, apply_chat_template=False), ] @pytest.mark.parametrize("generation_config_kwargs", configs) @pytest.mark.precommit diff --git a/tests/python_tests/test_llm_pipeline.py b/tests/python_tests/test_llm_pipeline.py index 8968f2a083..276aff7251 100644 --- a/tests/python_tests/test_llm_pipeline.py +++ b/tests/python_tests/test_llm_pipeline.py @@ -26,7 +26,7 @@ test_cases = [ (dict(max_new_tokens=20), '你好! 你好嗎?'), - (dict(max_new_tokens=30, num_beams=15, num_beam_groups=3, num_return_sequences=15, diversity_penalty=1.0), 'Alan Turing was a'), + (dict(max_new_tokens=30, num_beams=15, num_beam_groups=3, num_return_sequences=15, diversity_penalty=1.0), 'Why is the Sun yellow?'), ] @pytest.mark.parametrize("generation_config_dict,prompt", test_cases) @pytest.mark.parametrize("model_descr", get_models_list()) @@ -339,7 +339,7 @@ def test_unicode_pybind_decoding_one_string(): # Test that pybind will not fail. model_id, path = 'katuni4ka/tiny-random-phi3', Path('tiny-random-phi3') ov_pipe = read_model((model_id, path))[4] - res_str = ov_pipe.generate(',', max_new_tokens=4) + res_str = ov_pipe.generate(',', max_new_tokens=4, apply_chat_template=False) assert '�' == res_str[-1] @@ -350,7 +350,7 @@ def test_unicode_pybind_decoding_batched(): # Test that pybind will not fail. model_id, path = 'katuni4ka/tiny-random-phi3', Path('tiny-random-phi3') ov_pipe = read_model((model_id, path))[4] - res_str = ov_pipe.generate([","], max_new_tokens=4) + res_str = ov_pipe.generate([","], max_new_tokens=4, apply_chat_template=False) assert '�' == res_str.texts[0][-1] @@ -362,7 +362,7 @@ def test_unicode_pybind_decoding_one_string_streamer(): model_id, path = 'katuni4ka/tiny-random-phi3', Path('tiny-random-phi3') ov_pipe = read_model((model_id, path))[4] res_str = [] - ov_pipe.generate(",", max_new_tokens=4, streamer=lambda x: res_str.append(x)) + ov_pipe.generate(",", max_new_tokens=4, apply_chat_template=False, streamer=lambda x: res_str.append(x)) assert '�' == ''.join(res_str)[-1] # diff --git a/tests/python_tests/test_sampling.py b/tests/python_tests/test_sampling.py index 7a3aced29a..28b2afd42a 100644 --- a/tests/python_tests/test_sampling.py +++ b/tests/python_tests/test_sampling.py @@ -18,7 +18,7 @@ (dict(max_new_tokens=30, min_new_tokens=30), '你好! 你好嗎?'), (dict(max_new_tokens=30, ignore_eos=True), 'Alan Turing was a'), # (dict(max_length=40), 'table is made of'), - (dict(stop_token_ids={28998}), 'The Sun is yellow because'), # since a test does not hang, it means stop token is met + (dict(stop_token_ids={28998}, apply_chat_template=False), 'The Sun is yellow because'), # since a test does not hang, it means stop token is met, skip chat template to generate long answer # (dict(max_new_tokens=1, min_new_tokens=0, echo=True), 'What is OpenVINO?') ], ids=["max_new_tokens", @@ -59,7 +59,7 @@ def test_stop_strings(tmp_path, generation_config): @pytest.mark.parametrize("generation_config", [dict(max_new_tokens=30), dict(max_new_tokens=30, repetition_penalty=2.0), - dict(max_new_tokens=300)], + dict(max_new_tokens=300, apply_chat_template=False)], ids=["basic", "repetition_penalty", "long_max_new_tokens"]) @pytest.mark.parametrize("prompt", [ 'What is OpenVINO?', diff --git a/tools/llm_bench/task/text_generation.py b/tools/llm_bench/task/text_generation.py index 76f5678dd9..7b123cc7b3 100644 --- a/tools/llm_bench/task/text_generation.py +++ b/tools/llm_bench/task/text_generation.py @@ -234,6 +234,7 @@ def run_text_generation_genai(input_text, num, model, tokenizer, args, iter_data gen_config.rng_seed = args["seed"] gen_config.num_beams = args["num_beams"] gen_config.do_sample = False + gen_config.apply_chat_template = False if args.get('draft_model', ''): config_info = "Speculative decoding config: " if args.get('num_assistant_tokens', None): @@ -381,6 +382,7 @@ def run_text_generation_genai_with_stream(input_text, num, model, tokenizer, arg gen_config.num_beams = args["num_beams"] gen_config.do_sample = False gen_config.ignore_eos = True + gen_config.apply_chat_template = False enable_prompt_permutations = not args.get("disable_prompt_permutation", False) if enable_prompt_permutations: log.warning( diff --git a/tools/llm_bench/task/visual_language_generation.py b/tools/llm_bench/task/visual_language_generation.py index a02b16b2bb..9cc6702999 100644 --- a/tools/llm_bench/task/visual_language_generation.py +++ b/tools/llm_bench/task/visual_language_generation.py @@ -211,6 +211,7 @@ def run_visual_language_generation_genai( gen_config.max_new_tokens = max_gen_tokens gen_config.num_beams = args["num_beams"] gen_config.do_sample = False + gen_config.apply_chat_template = False kwargs = {} if len(images) >= 1: kwargs["images"] = images[0] diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py index 1eb778a060..408442a3d9 100644 --- a/tools/who_what_benchmark/whowhatbench/wwb.py +++ b/tools/who_what_benchmark/whowhatbench/wwb.py @@ -267,7 +267,7 @@ def genai_gen_text(model, tokenizer, question, max_new_tokens, skip_question, us model.finish_chat() return result else: - return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens) + return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens, apply_chat_template=False) def llamacpp_gen_text(model, tokenizer, question, max_new_tokens, skip_question, use_chat_template=False): @@ -336,6 +336,7 @@ def genai_gen_visual_text(model, prompt, image, processor, tokenizer, max_new_to config = model.get_generation_config() config.max_new_tokens = max_new_tokens config.do_sample = False + config.apply_chat_template = False model.set_generation_config(config) model.start_chat() From 6c3ecf9c65565505798be444c98dc1514552e6bc Mon Sep 17 00:00:00 2001 From: Vladimir Zlobin Date: Wed, 29 Jan 2025 19:35:52 +0400 Subject: [PATCH 07/15] beam_search_causal_lm.cpp: delete wrong comment (#1639) --- samples/cpp/text_generation/beam_search_causal_lm.cpp | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/samples/cpp/text_generation/beam_search_causal_lm.cpp b/samples/cpp/text_generation/beam_search_causal_lm.cpp index 9e1ee069ad..2f50100ac6 100644 --- a/samples/cpp/text_generation/beam_search_causal_lm.cpp +++ b/samples/cpp/text_generation/beam_search_causal_lm.cpp @@ -19,9 +19,7 @@ int main(int argc, char* argv[]) try { config.num_beams = 15; config.diversity_penalty = 1.0f; config.num_return_sequences = config.num_beams; - - // Since the streamer is set, the results will - // be printed each time a new token is generated. + auto beams = pipe.generate(prompts, config); std::cout << beams << '\n'; } catch (const std::exception& error) { From ec50b5b68baf2736da7f36e2543ec68667b4e064 Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Wed, 29 Jan 2025 20:19:21 +0400 Subject: [PATCH 08/15] [WWB]: Fixed nano-Llava preprocessor selection (#1646) Partially fixes WWB flow for nano-Llava. It works for Optimum inference but requires additional changes on the Optimum side to support HF Transformers. --- tools/who_what_benchmark/whowhatbench/wwb.py | 24 +++++++++++--------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py index 408442a3d9..ce88ef1fab 100644 --- a/tools/who_what_benchmark/whowhatbench/wwb.py +++ b/tools/who_what_benchmark/whowhatbench/wwb.py @@ -4,7 +4,7 @@ import logging import os -from transformers import AutoTokenizer, AutoProcessor +from transformers import AutoTokenizer, AutoProcessor, AutoConfig import openvino as ov import pandas as pd @@ -220,17 +220,19 @@ def load_tokenizer(args): def load_processor(args): - processor = None - if args.base_model is not None: - processor = AutoProcessor.from_pretrained( - args.base_model, trust_remote_code=True - ) - elif args.target_model is not None: - processor = AutoProcessor.from_pretrained( - args.target_model, trust_remote_code=True - ) + model_id = args.base_model if args.base_model is not None else args.target_model + if model_id is None: + return None + + config = AutoConfig.from_pretrained(model_id, trust_remote_code=True) + if "llava-qwen" in config.model_type: + preprocessor_id = config.mm_vision_tower + else: + preprocessor_id = model_id - return processor + return AutoProcessor.from_pretrained( + preprocessor_id, trust_remote_code=True + ) def diff_strings(a: str, b: str, *, use_loguru_colors: bool = False) -> str: From 624eb00c24383c6db5156560b4d9bc80faf1fce1 Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Thu, 30 Jan 2025 10:37:54 +0400 Subject: [PATCH 09/15] [WWB]: Added config to preprocessor call in VLMs (#1638) --- tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py b/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py index f0989e9041..6e259bf409 100644 --- a/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py +++ b/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py @@ -118,7 +118,7 @@ def default_gen_answer( preprocess_inputs = MODEL_TYPE_TO_CLS_MAPPING[ model.config.model_type ].preprocess_inputs - inputs = preprocess_inputs(prompt, image, processor, tokenizer) + inputs = preprocess_inputs(prompt, image, processor, tokenizer, config=model.config) tokens = model.generate( **inputs, do_sample=False, From 0c7ce5e0603cc9a98c785f605d369535695c39e5 Mon Sep 17 00:00:00 2001 From: Ilya Lavrenov Date: Thu, 30 Jan 2025 11:18:13 +0400 Subject: [PATCH 10/15] CB: remove DeviceConfig class (#1640) --- .github/labeler.yml | 1 - src/cpp/src/cache_manager.hpp | 81 ++++++++++++++--- src/cpp/src/continuous_batching_impl.cpp | 46 +++++----- src/cpp/src/continuous_batching_impl.hpp | 4 +- src/cpp/src/device_config.hpp | 91 ------------------- .../models/autoencoder_kl.cpp | 2 - .../src/paged_attention_transformations.cpp | 33 ++----- .../src/paged_attention_transformations.hpp | 22 +++-- src/cpp/src/scheduler.hpp | 13 ++- ...batching_for_speculative_decoding_impl.cpp | 4 +- ...batching_for_speculative_decoding_impl.hpp | 2 +- .../speculative_decoding_impl.cpp | 27 +++--- tests/cpp/cache_manager.cpp | 26 ++---- tests/cpp/scheduler.cpp | 4 +- 14 files changed, 145 insertions(+), 211 deletions(-) delete mode 100644 src/cpp/src/device_config.hpp diff --git a/.github/labeler.yml b/.github/labeler.yml index a75abd795c..dbc319c29a 100644 --- a/.github/labeler.yml +++ b/.github/labeler.yml @@ -99,7 +99,6 @@ - 'src/cpp/src/continuous_batching_impl.cpp' - 'src/cpp/src/continuous_batching_pipeline.cpp' - 'src/cpp/src/debug_utils.hpp' -- 'src/cpp/src/device_config.hpp' - 'src/cpp/src/generation_handle.cpp' - 'src/cpp/src/generation_stream.hpp' - 'src/cpp/src/model_runner.hpp' diff --git a/src/cpp/src/cache_manager.hpp b/src/cpp/src/cache_manager.hpp index 255bb926be..cc89497625 100644 --- a/src/cpp/src/cache_manager.hpp +++ b/src/cpp/src/cache_manager.hpp @@ -5,12 +5,12 @@ #include #include + #include "openvino/runtime/tensor.hpp" -#include "device_config.hpp" +#include "paged_attention_transformations.hpp" #ifndef _WIN32 #include -#include "openvino/core/shape.hpp" class TensorMmapAllocator { @@ -49,11 +49,13 @@ namespace ov::genai { class CacheManager { size_t m_num_decoder_layers = 0; std::string m_device; + size_t m_block_size = 0; // block size is per inference device std::vector m_key_precisions, m_value_precisions; std::vector m_key_shapes, m_value_shapes; std::vector m_key_cache, m_value_cache; size_t m_num_allocated_kv_blocks = 0, m_block_size_in_bytes = 0; ov::InferRequest m_request; + size_t m_k_head_size = 0; static ov::Shape set_kv_blocks(ov::PartialShape pshape, size_t num_kv_blocks) { pshape[0] = num_kv_blocks; @@ -65,47 +67,88 @@ class CacheManager { m_request.set_tensor(std::string("value_cache.") + std::to_string(decoder_layer_id), m_value_cache[decoder_layer_id]); } - ov::PartialShape patch_shape(ov::PartialShape pshape, ov::element::Type cache_type) { + ov::PartialShape to_partial_shape(const KVHeadConfig& config, ov::element::Type cache_type, bool key_param) { OPENVINO_ASSERT(!m_device.empty(), "Internal error: device is not set"); + OPENVINO_ASSERT(m_block_size > 0, "Internal error: block size is not set yet"); + + ov::PartialShape pshape; + + if (m_device.find("CPU") != std::string::npos) { + if (key_param) { + pshape = ov::PartialShape{ov::Dimension::dynamic(), + ov::Dimension(config.num_k_heads), + ov::Dimension(m_block_size), + ov::Dimension(config.k_head_size)}; - if (m_device.find("CPU") != std::string::npos && cache_type == ov::element::u8) { - // Scale, zero point and quantized data will be stored together. - // The layout for per token per head: - // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)| - // so, we have to extend head_size by 8, which is sizeof(float) - // for scale and sizeof(float) for zeropoint - pshape[3] += 2 * sizeof(float); + if (m_k_head_size == 0) { + m_k_head_size = config.k_head_size; + } + } else { + pshape = ov::PartialShape{ov::Dimension::dynamic(), + ov::Dimension(config.num_v_heads), + ov::Dimension(m_block_size), + ov::Dimension(config.v_head_size)}; + } + + if (cache_type == ov::element::u8) { + // Scale, zero point and quantized data will be stored together. + // The layout for per token per head: + // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)| + // so, we have to extend head_size by 8, which is sizeof(float) + // for scale and sizeof(float) for zeropoint + pshape[3] += 2 * sizeof(float); + } + } else if (m_device.find("GPU") != std::string::npos) { + if (key_param) { + pshape = ov::PartialShape{ov::Dimension::dynamic(), + ov::Dimension(config.num_k_heads), + ov::Dimension(config.k_head_size), + ov::Dimension(m_block_size)}; + } else { + pshape = ov::PartialShape{ov::Dimension::dynamic(), + ov::Dimension(config.num_v_heads), + ov::Dimension(m_block_size), + ov::Dimension(config.v_head_size)}; + } + } else { + OPENVINO_THROW("Internal error: unsupported device ", m_device); } return pshape; } public: - CacheManager(ov::InferRequest request, const DeviceConfig& device_config) : + CacheManager(ov::InferRequest request, const std::vector& kv_cache_config) : m_request(request) { // extract information about inference device ov::CompiledModel compiled_model = request.get_compiled_model(); std::vector execution_devices = compiled_model.get_property(ov::execution_devices); OPENVINO_ASSERT(execution_devices.size() == 1, "Contituous batching: execution device is expected to be CPU or GPU, but got ", execution_devices.size(), " devices"); m_device = execution_devices[0]; + + // set block_size depending on device + const size_t cpu_block_size = 32, gpu_block_size = 16; + const bool is_gpu = m_device.find("GPU") != std::string::npos; + m_block_size = is_gpu ? gpu_block_size : cpu_block_size; // extract information about KV cache precisions and shapes size_t kv_input_index = 0; for (const auto& input : compiled_model.inputs()) { for (auto & name : input.get_names()) { auto cache_precision = input.get_element_type(); + ov::PartialShape pshape; if (name.find("key_cache.") == 0) { - auto pshape = patch_shape(device_config.get_key_cache_shape(kv_input_index), cache_precision); + pshape = to_partial_shape(kv_cache_config[kv_input_index], cache_precision, true); + m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size(); m_key_shapes.push_back(pshape); m_key_precisions.push_back(cache_precision); - m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size(); break; } else if (name.find("value_cache.") == 0) { - auto pshape = patch_shape(device_config.get_value_cache_shape(kv_input_index), cache_precision); + pshape = to_partial_shape(kv_cache_config[kv_input_index], cache_precision, false); + m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size(); m_value_shapes.push_back(pshape); m_value_precisions.push_back(cache_precision); - m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size(); ++kv_input_index; break; } @@ -124,6 +167,10 @@ class CacheManager { return m_device; } + size_t get_block_size() const { + return m_block_size; + } + ov::element::Type get_key_cache_precision(size_t decoder_layer_id) const { OPENVINO_ASSERT(decoder_layer_id < m_key_precisions.size()); return m_key_precisions[decoder_layer_id]; @@ -251,6 +298,10 @@ class CacheManager { return m_value_cache[decoder_layer_id]; } + size_t get_v_head_size(size_t layer_id) const { + return m_value_shapes[layer_id][3].get_length(); + } + void copy_blocks(const std::map>& block_copy_map) { for (const auto & blocks_pair : block_copy_map) { size_t src_block_id = blocks_pair.first; diff --git a/src/cpp/src/continuous_batching_impl.cpp b/src/cpp/src/continuous_batching_impl.cpp index b4100f8aec..f95cd3b9c6 100644 --- a/src/cpp/src/continuous_batching_impl.cpp +++ b/src/cpp/src/continuous_batching_impl.cpp @@ -113,14 +113,12 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::ContinuousBatchingImpl( m_generation_config = generation_config; m_is_validation_mode_enabled = is_validation_mode_enabled; - DeviceConfig device_config(device); - bool is_need_per_layer_cache_control = scheduler_config.use_cache_eviction; bool allow_cache_rotation = scheduler_config.cache_eviction_config.apply_rotation; - utils::apply_paged_attention_transformations(model, device_config, is_need_per_layer_cache_control, allow_cache_rotation); + auto kv_cache_config = utils::apply_paged_attention_transformations(model, is_need_per_layer_cache_control, allow_cache_rotation); utils::apply_gather_before_matmul_transformation(model); - initialize_pipeline(model, scheduler_config, properties, device_config); + initialize_pipeline(model, scheduler_config, device, properties, kv_cache_config); } ContinuousBatchingPipeline::ContinuousBatchingImpl::~ContinuousBatchingImpl() { @@ -139,29 +137,31 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_pull_awaiting_requests void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( std::shared_ptr model, const SchedulerConfig& scheduler_config, + const std::string& device, const ov::AnyMap& properties, - const DeviceConfig& device_config) { + const std::vector& kv_cache_config) { ov::Core core = utils::singleton_core(); ov::CompiledModel compiled_model; // TODO: remove once plugin automatically set KV cache precisions - apply_kv_cache_precision(model, device_config.get_device(), properties); + apply_kv_cache_precision(model, device, properties); // apply LoRA if (auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters)) { m_generation_config.adapters->set_tensor_name_prefix("base_model.model.model."); - m_adapter_controller = AdapterController(model, *m_generation_config.adapters, device_config.get_device()); // TODO: Make the prefix name configurable - compiled_model = core.compile_model(model, device_config.get_device(), *filtered_properties); + m_adapter_controller = AdapterController(model, *m_generation_config.adapters, device); // TODO: Make the prefix name configurable + compiled_model = core.compile_model(model, device, *filtered_properties); } else { - compiled_model = core.compile_model(model, device_config.get_device(), properties); + compiled_model = core.compile_model(model, device, properties); } ov::genai::utils::print_compiled_model_properties(compiled_model, "LLM with Paged Attention"); ov::InferRequest infer_request = compiled_model.create_infer_request(); // Cache manager - std::shared_ptr cache_manager = std::make_shared(infer_request, device_config); + std::shared_ptr cache_manager = std::make_shared(infer_request, kv_cache_config); m_num_decoder_layers = cache_manager->get_num_decoder_layers(); + m_block_size = cache_manager->get_block_size(); // Scheduler SchedulerConfig normalized_config = scheduler_config; @@ -171,13 +171,13 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( } bool can_use_partial_preemption = true; - if (device_config.get_device().find("GPU") != std::string::npos && !normalized_config.dynamic_split_fuse) { + if (device.find("GPU") != std::string::npos && !normalized_config.dynamic_split_fuse) { // in case of executing a `vLLM-like` pipeline, it's better not to use partial eviction on the GPU, // as it may lead to performance slowdown can_use_partial_preemption = false; } - m_scheduler = std::make_shared(device_config.get_block_size(), cache_manager, normalized_config, m_num_decoder_layers, can_use_partial_preemption); + m_scheduler = std::make_shared(m_block_size, cache_manager, normalized_config, m_num_decoder_layers, can_use_partial_preemption); // Model Runner bool is_use_cache_eviction = m_scheduler->get_config().use_cache_eviction; @@ -185,7 +185,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( const auto& eviction_config = m_scheduler->get_config().cache_eviction_config; bool is_apply_rotation = eviction_config.apply_rotation; m_model_runner = std::make_shared(infer_request, - m_scheduler->get_block_size(), + m_block_size, m_num_decoder_layers, /* collect_attention_scores = */ true, /* is_use_per_layer_cache_control = */ true, @@ -199,10 +199,10 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( m_rotation_deltas_stores.push_back(store); } - size_t max_sequence_cache_occupation_length_in_blocks = normalized_config.max_num_batched_tokens / m_scheduler->get_block_size() + 1; - size_t embedding_size = device_config.get_k_head_size(0); + size_t max_sequence_cache_occupation_length_in_blocks = normalized_config.max_num_batched_tokens / m_block_size + 1; + size_t embedding_size = cache_manager->get_v_head_size(0); m_cache_rotation_calculator = std::make_shared( - m_scheduler->get_block_size(), + m_block_size, max_sequence_cache_occupation_length_in_blocks, embedding_size); auto rotation_trig_lut = ov::Tensor(ov::element::f32, ov::Shape{max_sequence_cache_occupation_length_in_blocks, embedding_size}); @@ -224,7 +224,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( } } else { m_model_runner = - std::make_shared(infer_request, m_scheduler->get_block_size(), m_num_decoder_layers); + std::make_shared(infer_request, m_block_size, m_num_decoder_layers); } m_sampler = std::make_shared(m_tokenizer); @@ -245,9 +245,7 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::add_request(uint64_t request sampling_params.set_eos_token_id(m_generation_config.eos_token_id); sampling_params.validate(); - SequenceGroup::Ptr sequence_group = std::make_shared(request_id, input_ids, - sampling_params, - m_scheduler->get_block_size()); + SequenceGroup::Ptr sequence_group = std::make_shared(request_id, input_ids, sampling_params, m_block_size); if (m_scheduler->get_config().enable_prefix_caching) { m_scheduler->restore_cached_blocks(sequence_group); @@ -662,8 +660,8 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_compute_cache_rotation size_t block_offset = num_blocks_to_rotate_for_each_layer[layer_idx]; auto rotation_deltas_tensor_data = m_rotation_deltas_stores[layer_idx].data() + block_offset; - for (size_t tok_idx = 0; tok_idx < m_scheduler->get_block_size(); tok_idx++) { - rotation_deltas_tensor_data[tok_idx] = block_rotation_data.rotation_delta / m_scheduler->get_block_size(); + for (size_t tok_idx = 0; tok_idx < m_block_size; tok_idx++) { + rotation_deltas_tensor_data[tok_idx] = block_rotation_data.rotation_delta / m_block_size; } num_blocks_to_rotate_for_each_layer[layer_idx] += 1; } @@ -693,7 +691,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_maybe_evict_cache_bloc auto seq_id = seq_id_and_attention_scores.first; const auto& attention_scores_for_all_decoder_layers = seq_id_and_attention_scores.second; if (m_seq_group_id_to_cache_eviction_algo_map.find(seq_id) == m_seq_group_id_to_cache_eviction_algo_map.end()) { - m_seq_group_id_to_cache_eviction_algo_map[seq_id] = CacheEvictionAlgorithm(sched_config.cache_eviction_config, m_scheduler->get_block_size(), num_decoder_layers); + m_seq_group_id_to_cache_eviction_algo_map[seq_id] = CacheEvictionAlgorithm(sched_config.cache_eviction_config, m_block_size, num_decoder_layers); } auto& cache_eviction_algo = m_seq_group_id_to_cache_eviction_algo_map[seq_id]; cache_eviction_algo.register_new_token_scores(attention_scores_for_all_decoder_layers); @@ -728,7 +726,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_maybe_evict_cache_bloc // Assuming that the evicted blocks are always full (since they by design are only selected from intermediate-age blocks) auto seq_group_ptr = seq_group_ptr_and_num_blocks_evicted.first; auto num_blocks_evicted = seq_group_ptr_and_num_blocks_evicted.second; - seq_group_ptr->register_token_eviction(num_blocks_evicted * m_scheduler->get_block_size()); + seq_group_ptr->register_token_eviction(num_blocks_evicted * m_block_size); } } diff --git a/src/cpp/src/continuous_batching_impl.hpp b/src/cpp/src/continuous_batching_impl.hpp index 9fa6c9c660..1ee40ef73c 100644 --- a/src/cpp/src/continuous_batching_impl.hpp +++ b/src/cpp/src/continuous_batching_impl.hpp @@ -37,6 +37,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc bool m_is_validation_mode_enabled = false; size_t m_num_decoder_layers = 0; + size_t m_block_size = 0; // Pre-allocated per-layer storages for the per-token cache re-rotation deltas used in cache eviction case std::vector m_rotation_deltas_stores; @@ -58,8 +59,9 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc void initialize_pipeline(std::shared_ptr model, const SchedulerConfig& scheduler_config, + const std::string& device, const ov::AnyMap& plugin_config, - const DeviceConfig& device_config); + const std::vector& kv_cache_config); /** * Pulls requests from awaiting queue to running queue diff --git a/src/cpp/src/device_config.hpp b/src/cpp/src/device_config.hpp deleted file mode 100644 index 09020da9a8..0000000000 --- a/src/cpp/src/device_config.hpp +++ /dev/null @@ -1,91 +0,0 @@ -// Copyright (C) 2023-2025 Intel Corporation -// SPDX-License-Identifier: Apache-2.0 - -#pragma once - -#include "openvino/runtime/core.hpp" -#include "openvino/core/shape.hpp" -#include "openvino/core/type/element_type.hpp" - -#include "openvino/genai/scheduler_config.hpp" - -namespace ov::genai { - -/** - * Per layer KV cache size configuration - */ -struct KVHeadConfig { - size_t num_v_heads, num_k_heads; - size_t v_head_size, k_head_size; -}; - -class DeviceConfig { - std::vector m_key_cache_shape, m_value_cache_shape; - std::vector m_kv_heads_config; - size_t m_block_size = 0; // block size is per inference device - std::string m_device; - - size_t get_block_size_by_device(const std::string& device) const { - const size_t cpu_block_size = 32, gpu_block_size = 16; - const bool is_gpu = device.find("GPU") != std::string::npos; - return is_gpu ? gpu_block_size : cpu_block_size; - } - -public: - explicit DeviceConfig(const std::string& device) { - m_device = device; - m_block_size = get_block_size_by_device(device); - } - - void set_kv_head_configs(const std::vector& kv_heads_config) { - m_kv_heads_config = kv_heads_config; - m_key_cache_shape.reserve(m_kv_heads_config.size()); - m_value_cache_shape.reserve(m_kv_heads_config.size()); - - for (size_t layer_id = 0; layer_id < kv_heads_config.size(); layer_id++) { - const KVHeadConfig& config = m_kv_heads_config[layer_id]; - - m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(), - ov::Dimension(config.num_v_heads), - ov::Dimension(m_block_size), - ov::Dimension(config.v_head_size)}); - - if (m_device.find("CPU") != std::string::npos) { - m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(), - ov::Dimension(config.num_k_heads), - ov::Dimension(m_block_size), - ov::Dimension(config.k_head_size)}); - } else if (m_device.find("GPU") != std::string::npos) { - // Update key shape, as the key's shape is different from the value's shape - m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(), - ov::Dimension(config.num_k_heads), - ov::Dimension(config.k_head_size), - ov::Dimension(m_block_size)}); - } - } - } - - std::string get_device() const { - return m_device; - } - - ov::PartialShape get_key_cache_shape(size_t id) const { - OPENVINO_ASSERT(m_key_cache_shape.size()); - return m_key_cache_shape[id]; - } - - ov::PartialShape get_value_cache_shape(size_t id) const { - OPENVINO_ASSERT(m_value_cache_shape.size()); - return m_value_cache_shape[id]; - } - - size_t get_k_head_size(size_t layer_id) const { - return m_kv_heads_config[layer_id].k_head_size; - } - - size_t get_block_size() const { - return m_block_size; - } -}; - -} diff --git a/src/cpp/src/image_generation/models/autoencoder_kl.cpp b/src/cpp/src/image_generation/models/autoencoder_kl.cpp index bcec125375..e7357c3f36 100644 --- a/src/cpp/src/image_generation/models/autoencoder_kl.cpp +++ b/src/cpp/src/image_generation/models/autoencoder_kl.cpp @@ -68,8 +68,6 @@ class DiagonalGaussianDistribution { // for BW compatibility with 2024.6.0 ov::AnyMap handle_scale_factor(std::shared_ptr model, const std::string& device, ov::AnyMap properties) { - std::cout << ov::Any(properties).as() << std::endl; - auto it = properties.find("WA_INFERENCE_PRECISION_HINT"); ov::element::Type wa_inference_precision = it != properties.end() ? it->second.as() : ov::element::undefined; if (it != properties.end()) { diff --git a/src/cpp/src/paged_attention_transformations.cpp b/src/cpp/src/paged_attention_transformations.cpp index 6d337136dc..1c7ffd51d2 100644 --- a/src/cpp/src/paged_attention_transformations.cpp +++ b/src/cpp/src/paged_attention_transformations.cpp @@ -10,27 +10,14 @@ namespace ov { namespace genai { namespace utils { -size_t get_hidden_size(const std::shared_ptr model) { - const auto& parameters = model->get_parameters(); - // extract num_kv_heads and head_size - size_t kv_caches_inputs_offset = 2; - ov::PartialShape k_shape = parameters[kv_caches_inputs_offset]->get_partial_shape(); - OPENVINO_ASSERT(k_shape.rank().get_length() == 3, "KV cache shape is expected to have rank 3, while shape is ", k_shape); - size_t num_kv_heads = k_shape[1].get_length(), head_size = k_shape[2].get_length(); - return num_kv_heads * head_size; -} - -void apply_paged_attention_transformations(std::shared_ptr model, bool per_layer_cache_control, bool allow_cache_rotation) { +std::vector apply_paged_attention_transformations(std::shared_ptr model, bool per_layer_cache_control, bool allow_cache_rotation) { const ov::op::util::VariableVector& variables = model->get_variables(); OPENVINO_ASSERT(!variables.empty(), "Model is supposed to be stateful"); bool use_block_indices_inputs = per_layer_cache_control; bool use_score_outputs = per_layer_cache_control; - ov::pass::SDPAToPagedAttention(use_block_indices_inputs, use_score_outputs, allow_cache_rotation) - .run_on_model(model); -} + ov::pass::SDPAToPagedAttention(use_block_indices_inputs, use_score_outputs, allow_cache_rotation).run_on_model(model); -void set_kv_cache_type_and_shape(std::shared_ptr model, DeviceConfig& device_config) { std::map> key_cache_params, value_cache_params; for (const auto& param_ptr : model->get_parameters()) { const auto& name = param_ptr->get_friendly_name(); @@ -44,10 +31,10 @@ void set_kv_cache_type_and_shape(std::shared_ptr model, DeviceConfig& OPENVINO_ASSERT(key_cache_params.size() == value_cache_params.size() && key_cache_params.size() > 0); size_t num_decoder_layers = key_cache_params.size(); - std::vector kv_heads_config(num_decoder_layers); + std::vector kv_cache_config(num_decoder_layers); for (size_t idx = 0; idx < num_decoder_layers; idx++) { - KVHeadConfig& config = kv_heads_config[idx]; + KVHeadConfig& config = kv_cache_config[idx]; auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)]; auto key_shape = k->get_partial_shape(); @@ -60,10 +47,7 @@ void set_kv_cache_type_and_shape(std::shared_ptr model, DeviceConfig& config.v_head_size = value_shape[2].get_length(); } - // save information about KV caches in device_config - // and create device dependent KV cache shapes - device_config.set_kv_head_configs(kv_heads_config); - + // reset information in KV cache parameters for (size_t idx = 0; idx < num_decoder_layers; idx++) { auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)]; auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)]; @@ -72,17 +56,14 @@ void set_kv_cache_type_and_shape(std::shared_ptr model, DeviceConfig& k->set_element_type(ov::element::dynamic); v->set_element_type(ov::element::dynamic); - // set device specific KV cache shapes back to a PA model + // order of dimensions within shapes are not required for plugin during compilation k->set_partial_shape(ov::PartialShape::dynamic(4)); v->set_partial_shape(ov::PartialShape::dynamic(4)); } model->validate_nodes_and_infer_types(); -} -void apply_paged_attention_transformations(std::shared_ptr model, DeviceConfig& device_config, bool per_layer_cache_control, bool allow_cache_rotation) { - apply_paged_attention_transformations(model, per_layer_cache_control, allow_cache_rotation); - set_kv_cache_type_and_shape(model, device_config); + return kv_cache_config; } } // namespace utils diff --git a/src/cpp/src/paged_attention_transformations.hpp b/src/cpp/src/paged_attention_transformations.hpp index aa86db2657..66cc6d6bc1 100644 --- a/src/cpp/src/paged_attention_transformations.hpp +++ b/src/cpp/src/paged_attention_transformations.hpp @@ -3,14 +3,23 @@ #pragma once +#include + #include "openvino/core/any.hpp" #include "openvino/core/model.hpp" -#include "device_config.hpp" namespace ov { namespace genai { -namespace utils { +/** + * Per layer KV cache size configuration + */ +struct KVHeadConfig { + size_t num_v_heads, num_k_heads; + size_t v_head_size, k_head_size; +}; + +namespace utils { /** Applies transformations to the ov::Model to enable paged attention inference. * @param model Pointer to the ov::Model representing one of the supported LLM architectures. @@ -18,14 +27,9 @@ namespace utils { * @param per_layer_cache_control If true, then the transformations will enable per-layer control of KV cache blocks, allowing to specify * different sets of KV cache blocks for different attention layers. If false, then the KV cache block structure will be identical across all * decoder layers. + * @return Information about each decoder layer configuration */ -void apply_paged_attention_transformations(std::shared_ptr model, DeviceConfig& device_config, bool per_layer_cache_control = false, bool allow_cache_rotation = false); - -void apply_paged_attention_transformations(std::shared_ptr model, bool per_layer_cache_control = false, bool allow_cache_rotation = false); - -size_t get_hidden_size(const std::shared_ptr model); - -void set_kv_cache_type_and_shape(std::shared_ptr model, DeviceConfig& device_config); +std::vector apply_paged_attention_transformations(std::shared_ptr model, bool per_layer_cache_control = false, bool allow_cache_rotation = false); void apply_gather_before_matmul_transformation(std::shared_ptr model); diff --git a/src/cpp/src/scheduler.hpp b/src/cpp/src/scheduler.hpp index 23db68deab..160734b520 100644 --- a/src/cpp/src/scheduler.hpp +++ b/src/cpp/src/scheduler.hpp @@ -9,7 +9,6 @@ #include "openvino/runtime/intel_gpu/properties.hpp" #include "openvino/genai/scheduler_config.hpp" -#include "device_config.hpp" #include "block_manager.hpp" #include "sequence_group.hpp" #include "cache_manager.hpp" @@ -45,10 +44,10 @@ class Scheduler { float m_cache_usage = 0.0; }; - explicit Scheduler(size_t block_size, std::shared_ptr cache_manager, const SchedulerConfig & config = {}, size_t num_layers = 1, bool can_use_partial_preemption = true) : - m_cache_manager(cache_manager), - m_can_use_partial_preemption(can_use_partial_preemption), - m_config(config) { + Scheduler(size_t block_size, std::shared_ptr cache_manager, const SchedulerConfig & config = {}, size_t num_layers = 1, bool can_use_partial_preemption = true) : + m_cache_manager(cache_manager), + m_can_use_partial_preemption(can_use_partial_preemption), + m_config(config) { m_block_manager = std::make_shared(m_config.num_kv_blocks, m_config.enable_prefix_caching, block_size, num_layers); OPENVINO_ASSERT(num_layers != 0, "num_layers must be non-zero"); } @@ -499,13 +498,13 @@ class Scheduler { auto seq_length = sequence_groups[idx]->get_prompt_len() * m_kv_blocks_initial_multiplier; auto gen_config = sequence_groups[idx]->get_sampling_parameters(); seq_length = std::min(seq_length, sequence_groups[idx]->get_prompt_len() + sequence_groups[idx]->get_max_new_tokens()); - size_t blocks_num = std::ceil((float)seq_length / m_block_manager->get_block_size()); + size_t blocks_num = std::ceil(static_cast(seq_length) / m_block_manager->get_block_size()); if (gen_config.is_beam_search()) { blocks_num *= gen_config.num_beams; } else if (gen_config.is_multinomial()) { blocks_num *= gen_config.num_return_sequences; } - blocks_sum += blocks_num; + blocks_sum += blocks_num; } m_block_manager->increase_kv_blocks_number(blocks_sum); m_dynamic_memory_allocation = true; diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp index 2ecdbd66f3..14dfaae60f 100644 --- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp +++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp @@ -8,7 +8,7 @@ ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::Contin const std::shared_ptr& model, const Tokenizer& tokenizer, const GenerationConfig& generation_config, - const DeviceConfig& device_config, + const std::vector& kv_cache_configs, const SchedulerConfig& scheduler_config, const std::string& device, const ov::AnyMap& plugin_config, @@ -16,7 +16,7 @@ ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::Contin m_tokenizer = tokenizer; m_generation_config = generation_config; m_is_validation_mode_enabled = is_validation_mode_enabled; - initialize_pipeline(model, scheduler_config, plugin_config, device_config); + initialize_pipeline(model, scheduler_config, device, plugin_config, kv_cache_configs); } void diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp index b714316e75..68cc0e45c4 100644 --- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp +++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp @@ -16,7 +16,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl : ContinuousBatchingForSpeculativeDecodingImpl(const std::shared_ptr& model, const Tokenizer& tokenizer, const GenerationConfig& generation_config, - const DeviceConfig& device_config, + const std::vector& kv_cache_configs, const SchedulerConfig& scheduler_config, const std::string& device, const ov::AnyMap& plugin_config, diff --git a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp index 32d13feed1..51490945e7 100644 --- a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp +++ b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp @@ -33,8 +33,8 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con auto main_scheduler_config = main_model_desc.scheduler_config; auto main_device = main_model_desc.device; - utils::apply_paged_attention_transformations(main_model, main_model_desc.scheduler_config.use_cache_eviction); - utils::apply_paged_attention_transformations(draft_model, main_model_desc.scheduler_config.use_cache_eviction); + auto main_kv_cache_config = utils::apply_paged_attention_transformations(main_model, main_model_desc.scheduler_config.use_cache_eviction); + auto draft_kv_cache_config = utils::apply_paged_attention_transformations(draft_model, main_model_desc.scheduler_config.use_cache_eviction); utils::apply_gather_before_matmul_transformation(main_model); utils::apply_gather_before_matmul_transformation(draft_model); @@ -47,10 +47,18 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con if (is_draft_scheduler_undefined) { // split KV cache to 2 caches for main and draft models - size_t main_model_hidden_size = utils::get_hidden_size(main_model), - draft_model_hidden_size = utils::get_hidden_size(draft_model); - auto k = static_cast(draft_model_hidden_size) / (main_model_hidden_size + draft_model_hidden_size); + auto compute_total_hidden_size = [] (const std::vector& kv_cache_config) -> size_t { + size_t total_hidden_size = 0; + for (auto & config : kv_cache_config) { + total_hidden_size += config.k_head_size * config.num_k_heads + config.v_head_size * config.num_v_heads; + } + return total_hidden_size; + }; + float main_model_hidden_size = compute_total_hidden_size(main_kv_cache_config), + draft_model_hidden_size = compute_total_hidden_size(draft_kv_cache_config); + auto k = draft_model_hidden_size / (main_model_hidden_size + draft_model_hidden_size); + // TODO: work with KV blocks as it will be more precise instead of GBs size_t main_cache_size = std::ceil(main_scheduler_config.cache_size * (1.f - k)), draft_cache_size = main_scheduler_config.cache_size - main_cache_size; if (draft_cache_size == 0 && main_cache_size > 0) { @@ -64,11 +72,6 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con ov::AnyMap draft_properties = draft_model_desc.properties.empty() ? main_model_desc.properties : draft_model_desc.properties; - DeviceConfig main_device_config(main_device), draft_device_config(draft_device); - - utils::set_kv_cache_type_and_shape(main_model, main_device_config); - utils::set_kv_cache_type_and_shape(draft_model, draft_device_config); - // main and draft model can have different tokenizers // to do: support retokenization: 154103 Tokenizer main_model_tokenizer = main_model_desc.tokenizer; @@ -82,10 +85,10 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con // to create `main_pipeline` with enabled validation_mode and `draft_pipeline` with disabled validation mode m_main_pipeline = std::make_shared( main_model, main_model_tokenizer, main_model_desc.generation_config, - main_device_config, main_scheduler_config_updated, main_device, main_model_desc.properties, true); + main_kv_cache_config, main_scheduler_config_updated, main_device, main_model_desc.properties, true); m_draft_pipeline = std::make_shared( draft_model, draft_model_tokenizer, draft_model_desc.generation_config, - draft_device_config, draft_scheduler_config, draft_device, draft_properties, false); + draft_kv_cache_config, draft_scheduler_config, draft_device, draft_properties, false); } GenerationHandle diff --git a/tests/cpp/cache_manager.cpp b/tests/cpp/cache_manager.cpp index 864a7b43af..986b342ca7 100644 --- a/tests/cpp/cache_manager.cpp +++ b/tests/cpp/cache_manager.cpp @@ -5,7 +5,6 @@ #include #include "openvino/runtime/core.hpp" #include "scheduler.hpp" -#include "device_config.hpp" #include "cache_manager.hpp" #include "helper.hpp" @@ -35,17 +34,15 @@ TEST(TestCacheManager, test_cache_size_param) { scheduler_config.max_num_seqs = 2; const std::string device = "CPU"; - DeviceConfig device_config("CPU"); const size_t num_decoder_layers = 12; - const std::vector kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); - device_config.set_kv_head_configs(kv_heads_config); + const std::vector kv_cache_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request(); - auto cache_manager = std::make_shared(request, device_config); + auto cache_manager = std::make_shared(request, kv_cache_config); ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers()); const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, cache_manager->get_block_size_in_bytes()); - auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers()); + auto block_manager = BlockManager(num_kv_blocks, false, cache_manager->get_block_size(), cache_manager->get_num_decoder_layers()); cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks()); const size_t kv_cache_total_size = scheduler_config.cache_size * 1024 * 1024 * 1024; @@ -63,13 +60,10 @@ TEST(TestCacheManager, test_kv_blocks_param) { scheduler_config.cache_size = 0; scheduler_config.max_num_seqs = 2; - const std::string device = "CPU"; - DeviceConfig device_config("CPU"); + const size_t cpu_block_size = 32; const size_t num_decoder_layers = 12; - const std::vector kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); - device_config.set_kv_head_configs(kv_heads_config); - auto block_manager = BlockManager(scheduler_config.num_kv_blocks, false, device_config.get_block_size(), num_decoder_layers); + auto block_manager = BlockManager(scheduler_config.num_kv_blocks, false, cpu_block_size, num_decoder_layers); ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), scheduler_config.num_kv_blocks); } @@ -83,17 +77,15 @@ TEST(TestCacheManager, test_dynamic_cache_increase) { scheduler_config.max_num_seqs = 2; const std::string device = "CPU"; - DeviceConfig device_config("CPU"); const size_t num_decoder_layers = 12; - const std::vector kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); - device_config.set_kv_head_configs(kv_heads_config); + const std::vector kv_cache_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 }); ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request(); - auto cache_manager = std::make_shared(request, device_config); + auto cache_manager = std::make_shared(request, kv_cache_config); size_t block_size_in_bytes = cache_manager->get_block_size_in_bytes(); const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, block_size_in_bytes); - auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers()); + auto block_manager = BlockManager(num_kv_blocks, false, cache_manager->get_block_size(), cache_manager->get_num_decoder_layers()); ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers()); // check initial cache allocation @@ -115,4 +107,4 @@ TEST(TestCacheManager, test_dynamic_cache_increase) { // check that cache does not increase if new blocks were not allocated cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks()); ASSERT_EQ(get_total_allocated_bytes(cache_manager), 200 * block_size_in_bytes); -} \ No newline at end of file +} diff --git a/tests/cpp/scheduler.cpp b/tests/cpp/scheduler.cpp index b6aa5a9b53..1e147203f4 100644 --- a/tests/cpp/scheduler.cpp +++ b/tests/cpp/scheduler.cpp @@ -26,9 +26,7 @@ std::shared_ptr init_cache_manager(SchedulerConfig scheduler_confi ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request(); const size_t head_size = 64; std::vector kv_head_configs(num_decoder_layers, KVHeadConfig { 12, 12, head_size, head_size }); - ov::genai::DeviceConfig device_config("CPU"); - device_config.set_kv_head_configs(kv_head_configs); - return std::make_shared(request, device_config); + return std::make_shared(request, kv_head_configs); } TEST(TestScheduler, general_test) { From 97bb83ad20317cef9b937f345c67252fbd9808e8 Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Thu, 30 Jan 2025 12:05:58 +0400 Subject: [PATCH 11/15] [WWB]: Added initialization of nano-llava in case of Transformers model (#1649) --- tools/who_what_benchmark/whowhatbench/model_loaders.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/who_what_benchmark/whowhatbench/model_loaders.py b/tools/who_what_benchmark/whowhatbench/model_loaders.py index c792a3c0b2..8ab73483b3 100644 --- a/tools/who_what_benchmark/whowhatbench/model_loaders.py +++ b/tools/who_what_benchmark/whowhatbench/model_loaders.py @@ -173,6 +173,10 @@ def load_visual_text_model( model_id, trust_remote_code=True, device_map=device.lower(), _attn_implementation="eager", use_flash_attention_2=False ) model.eval() + try: + model.get_vision_tower().load_model() + except Exception: + pass elif use_genai: logger.info("Using OpenVINO GenAI API") model = load_visual_text_genai_pipeline(model_id, device, ov_config) From debf2c60735e9dbae5c338b67f0dd5031ee751d8 Mon Sep 17 00:00:00 2001 From: Ilya Lavrenov Date: Thu, 30 Jan 2025 16:21:20 +0400 Subject: [PATCH 12/15] WWB: simplify code around start_chat / use_template (#1650) See https://github.com/openvinotoolkit/openvino.genai/pull/1533 Co-authored-by: Alexander Kozlov --- tools/who_what_benchmark/whowhatbench/wwb.py | 18 ++---------------- 1 file changed, 2 insertions(+), 16 deletions(-) diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py index ce88ef1fab..2008f6aba4 100644 --- a/tools/who_what_benchmark/whowhatbench/wwb.py +++ b/tools/who_what_benchmark/whowhatbench/wwb.py @@ -263,13 +263,7 @@ def diff_strings(a: str, b: str, *, use_loguru_colors: bool = False) -> str: def genai_gen_text(model, tokenizer, question, max_new_tokens, skip_question, use_chat_template=False): - if use_chat_template: - model.start_chat() - result = model.generate(question, do_sample=False, max_new_tokens=max_new_tokens) - model.finish_chat() - return result - else: - return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens, apply_chat_template=False) + return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens, apply_chat_template=use_chat_template) def llamacpp_gen_text(model, tokenizer, question, max_new_tokens, skip_question, use_chat_template=False): @@ -335,15 +329,7 @@ def genai_gen_inpainting(model, prompt, image, mask, num_inference_steps, genera def genai_gen_visual_text(model, prompt, image, processor, tokenizer, max_new_tokens, crop_question): image_data = ov.Tensor(np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.uint8)) - config = model.get_generation_config() - config.max_new_tokens = max_new_tokens - config.do_sample = False - config.apply_chat_template = False - model.set_generation_config(config) - - model.start_chat() - out = model.generate(prompt, image=image_data) - model.finish_chat() + out = model.generate(prompt, image=image_data, do_sample=False, max_new_tokens=max_new_tokens) return out.texts[0] From 0efa8a5b3bf7ffb06caa810c80ead149673ca5ab Mon Sep 17 00:00:00 2001 From: Ilya Lavrenov Date: Thu, 30 Jan 2025 16:42:15 +0400 Subject: [PATCH 13/15] Tokenizers update (#1653) To pick up https://github.com/openvinotoolkit/openvino_tokenizers/pull/391 --- thirdparty/openvino_tokenizers | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/thirdparty/openvino_tokenizers b/thirdparty/openvino_tokenizers index 09c7005e0d..83dda59a58 160000 --- a/thirdparty/openvino_tokenizers +++ b/thirdparty/openvino_tokenizers @@ -1 +1 @@ -Subproject commit 09c7005e0da46a50cc86b0e6e4ac9b8663a7af70 +Subproject commit 83dda59a5820334ce833f7a326bfb2911fc2576a From 40cb8498fec15be8f7e8b1475a972501235ea794 Mon Sep 17 00:00:00 2001 From: Ilya Lavrenov Date: Thu, 30 Jan 2025 17:54:19 +0400 Subject: [PATCH 14/15] DOCS: reorganized support models for image generation (#1655) --- SUPPORTED_MODELS.md | 80 ++++++++++++++++++++------------------------- 1 file changed, 35 insertions(+), 45 deletions(-) diff --git a/SUPPORTED_MODELS.md b/SUPPORTED_MODELS.md index c5c55b8d73..d8e9dbe191 100644 --- a/SUPPORTED_MODELS.md +++ b/SUPPORTED_MODELS.md @@ -166,6 +166,7 @@ The pipeline can work with other similar topologies produced by `optimum-intel` Architecture Text 2 image Image 2 image + Inpainting LoRA support Example HuggingFace Models @@ -174,6 +175,7 @@ The pipeline can work with other similar topologies produced by `optimum-intel` Supported Supported Supported + Supported