From 4521bb65f3f1c0dc8fed96cbcac0388210063694 Mon Sep 17 00:00:00 2001
From: Helena Kloosterman <helena.kloosterman@intel.com>
Date: Tue, 28 Jan 2025 15:35:45 +0100
Subject: [PATCH 01/15] do_sample=False for NPU in chat_sample, add NPU to
 README (#1637)

- make chat_sample work out of the box on NPU by forcing do_sample=False
for NPU
- add NPU info to text_generation samples README

and a small unrelated change:

- change `pip install` command for exporting models that are already on
huggingface-hub. No need to install all of PyTorch and transformers if
you only need to download a model.
---
 samples/cpp/text_generation/README.md    | 13 ++++++++++++-
 samples/python/text_generation/README.md | 13 ++++++++++++-
 2 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/samples/cpp/text_generation/README.md b/samples/cpp/text_generation/README.md
index f370c74a80..dd24b6ebf5 100644
--- a/samples/cpp/text_generation/README.md
+++ b/samples/cpp/text_generation/README.md
@@ -19,7 +19,7 @@ optimim-cli export openvino --model <model> <output_folder>
 ```
 If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
 ```sh
-pip install --upgrade-strategy eager -r ../../export-requirements.txt
+pip install huggingface-hub
 huggingface-cli download <model> --local-dir <output_folder>
 ```
 
@@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi
 "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
 ```
 
+#### NPU support
+
+NPU device is supported with some limitations. See [NPU inference of
+LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular:
+
+- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model <model> <output_folder>`).
+  For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`).
+- Beam search and parallel sampling are not supported.
+- Use OpenVINO 2025.0 or later (installed by deployment-requirements.txt, see "Common information" section), and the latest NPU driver.
+
+
 ### 2. Greedy Causal LM (`greedy_causal_lm`)
 - **Description:**
 Basic text generation using a causal language model.
diff --git a/samples/python/text_generation/README.md b/samples/python/text_generation/README.md
index 84b5302639..97a6ad59bc 100644
--- a/samples/python/text_generation/README.md
+++ b/samples/python/text_generation/README.md
@@ -19,7 +19,7 @@ optimim-cli export openvino --model <model> <output_folder>
 ```
 If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
 ```sh
-pip install --upgrade-strategy eager -r ../../export-requirements.txt
+pip install huggingface-hub
 huggingface-cli download <model> --local-dir <output_folder>
 ```
 
@@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi
 "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
 ```
 
+#### NPU support
+
+NPU device is supported with some limitations. See [NPU inference of
+LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular:
+
+- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model <model> <output_folder>`).
+  For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`).
+- Beam search and parallel sampling are not supported.
+- Use OpenVINO 2025.0 or later (installed by deployment-requirements.txt, see "Common information" section), and the latest NPU driver.
+
+
 ### 2. Greedy Causal LM (`greedy_causal_lm`)
 - **Description:**
 Basic text generation using a causal language model.

From 4fb48deed27d23cbb517eb1247e8f1f48bdc8596 Mon Sep 17 00:00:00 2001
From: Vishniakov Nikolai <nikolai.vishniakov@intel.com>
Date: Tue, 28 Jan 2025 16:08:48 +0100
Subject: [PATCH 02/15] [JS] Add GenAI Node.js bindings (#1193)

Adding Node.js bindings for GenAI pipelines.

- 155187
- 158132

## Limitations
Current version it's primary backbone of future development. Supports
bindings of `LLMPipeline` only.

## TODO
- [x] Test build configuration
- [x] Integrate unit tests run into GHA
- [ ] Add script to download runtime binaries (after binaries
publication)

---------

Co-authored-by: Vladimir Zlobin <vladimir.zlobin@intel.com>
---
 .github/workflows/linux.yml                   | 148 ++++--
 CMakeLists.txt                                |   5 +-
 cmake/features.cmake                          |   7 +
 pyproject.toml                                |   2 +-
 samples/js/text_generation/.gitignore         |   1 +
 samples/js/text_generation/README.md          |  48 ++
 samples/js/text_generation/chat_sample.js     |  54 ++
 samples/js/text_generation/package-lock.json  |  42 ++
 samples/js/text_generation/package.json       |  15 +
 .../js/text_generation/tests/usage.test.js    |  62 +++
 src/CMakeLists.txt                            |   4 +
 src/cpp/CMakeLists.txt                        |  35 +-
 src/js/.gitignore                             |   6 +
 src/js/.npmignore                             |  15 +
 src/js/CMakeLists.txt                         |  93 ++++
 src/js/README.md                              |  56 +++
 src/js/include/addon.hpp                      |  20 +
 src/js/include/helper.hpp                     |  23 +
 .../llm_pipeline/finish_chat_worker.hpp       |  18 +
 src/js/include/llm_pipeline/init_worker.hpp   |  21 +
 .../llm_pipeline/llm_pipeline_wrapper.hpp     |  27 +
 .../llm_pipeline/start_chat_worker.hpp        |  18 +
 src/js/lib/bindings.cjs                       |   1 +
 src/js/lib/module.js                          | 141 ++++++
 src/js/package-lock.json                      | 470 ++++++++++++++++++
 src/js/package.json                           |  30 ++
 src/js/src/addon.cpp                          |  30 ++
 src/js/src/helper.cpp                         |  53 ++
 .../src/llm_pipeline/finish_chat_worker.cpp   |  14 +
 src/js/src/llm_pipeline/init_worker.cpp       |  18 +
 .../src/llm_pipeline/llm_pipeline_wrapper.cpp | 153 ++++++
 src/js/src/llm_pipeline/start_chat_worker.cpp |  14 +
 src/js/tests/bindings.test.js                 |  58 +++
 src/js/tests/models.js                        |   3 +
 src/js/tests/module.test.js                   | 142 ++++++
 src/js/tests/setup.js                         |   6 +
 src/js/tests/utils.js                         |  47 ++
 src/js/thirdparty/node-lib.def                | 147 ++++++
 src/js/thirdparty/win_delay_load_hook.cc      |  52 ++
 39 files changed, 2062 insertions(+), 37 deletions(-)
 create mode 100644 samples/js/text_generation/.gitignore
 create mode 100644 samples/js/text_generation/README.md
 create mode 100644 samples/js/text_generation/chat_sample.js
 create mode 100644 samples/js/text_generation/package-lock.json
 create mode 100644 samples/js/text_generation/package.json
 create mode 100644 samples/js/text_generation/tests/usage.test.js
 create mode 100644 src/js/.gitignore
 create mode 100644 src/js/.npmignore
 create mode 100644 src/js/CMakeLists.txt
 create mode 100644 src/js/README.md
 create mode 100644 src/js/include/addon.hpp
 create mode 100644 src/js/include/helper.hpp
 create mode 100644 src/js/include/llm_pipeline/finish_chat_worker.hpp
 create mode 100644 src/js/include/llm_pipeline/init_worker.hpp
 create mode 100644 src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp
 create mode 100644 src/js/include/llm_pipeline/start_chat_worker.hpp
 create mode 100644 src/js/lib/bindings.cjs
 create mode 100644 src/js/lib/module.js
 create mode 100644 src/js/package-lock.json
 create mode 100644 src/js/package.json
 create mode 100644 src/js/src/addon.cpp
 create mode 100644 src/js/src/helper.cpp
 create mode 100644 src/js/src/llm_pipeline/finish_chat_worker.cpp
 create mode 100644 src/js/src/llm_pipeline/init_worker.cpp
 create mode 100644 src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp
 create mode 100644 src/js/src/llm_pipeline/start_chat_worker.cpp
 create mode 100644 src/js/tests/bindings.test.js
 create mode 100644 src/js/tests/models.js
 create mode 100644 src/js/tests/module.test.js
 create mode 100644 src/js/tests/setup.js
 create mode 100644 src/js/tests/utils.js
 create mode 100644 src/js/thirdparty/node-lib.def
 create mode 100644 src/js/thirdparty/win_delay_load_hook.cc

diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml
index 98ac356e11..27b8355ce6 100644
--- a/.github/workflows/linux.yml
+++ b/.github/workflows/linux.yml
@@ -42,7 +42,7 @@ jobs:
     runs-on: aks-linux-2-cores-8gb
     container:
       image: 'openvinogithubactions.azurecr.io/openvino_provider:0.1.0'
-      volumes: 
+      volumes:
         - /mount:/mount
         - ${{ github.workspace }}:${{ github.workspace }}
 
@@ -114,11 +114,11 @@ jobs:
           cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -S ${{ env.SRC_DIR}} -B ${{ env.BUILD_DIR }}
           cmake --build ${{ env.BUILD_DIR}} --config ${{ matrix.build-type }} --parallel $(nproc) --verbose
           cmake --install ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --prefix ${{ env.INSTALL_DIR }}
-      
+
       - name: Pack Artifacts
         run: tar -cvf - * | pigz > ${{ env.BUILD_DIR }}/${{ env.GENAI_ARCHIVE_NAME }}
         working-directory: ${{ env.INSTALL_DIR }}
-          
+
       - name: Upload Archive Distribution Package
         if: ${{ always() }}
         uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3
@@ -137,7 +137,7 @@ jobs:
     runs-on: aks-linux-4-cores-16gb
     container:
       image: openvinogithubactions.azurecr.io/ov_build/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }}
-      volumes: 
+      volumes:
         - /mount:/mount
         - ${{ github.workspace }}:${{ github.workspace }}
       options: -e SCCACHE_AZURE_BLOB_CONTAINER -e SCCACHE_AZURE_CONNECTION_STRING
@@ -161,7 +161,7 @@ jobs:
           name: ${{ needs.openvino_download.outputs.ov_artifact_name }}
           path: ${{ env.OV_INSTALL_DIR }}
           merge-multiple: true
-          
+
       - name: Build Tokenizers Wheel
         run: |
           python -m pip wheel -v --no-deps --wheel-dir ${{ env.WHEELS_DIR }} \
@@ -169,7 +169,7 @@ jobs:
               ${{ needs.openvino_download.outputs.ov_wheel_source }} \
               ${{ env.SRC_DIR }}/thirdparty/openvino_tokenizers
         working-directory: ${{ env.OV_INSTALL_DIR }}
-              
+
       - name: Build GenAI Wheel
         run: |
           python -m pip wheel -v --no-deps --wheel-dir ${{ env.WHEELS_DIR }} \
@@ -177,11 +177,11 @@ jobs:
               ${{ needs.openvino_download.outputs.ov_wheel_source }} \
               ${{ env.SRC_DIR }}
         working-directory: ${{ env.OV_INSTALL_DIR }}
-        
+
       - name: Build WWB Wheel
         run: python -m pip wheel -v --no-deps --wheel-dir ${{ env.WHEELS_DIR }} ${{ env.SRC_DIR }}/tools/who_what_benchmark
         working-directory: ${{ env.OV_INSTALL_DIR }}
-            
+
       - name: Upload Wheels
         if: ${{ always() }}
         uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3
@@ -189,7 +189,7 @@ jobs:
           name: genai_wheels
           path: ${{ env.INSTALL_DIR }}
           if-no-files-found: 'error'
-          
+
   genai_build_samples:
     name: Build Samples - ${{ matrix.build-type }}
     strategy:
@@ -204,7 +204,7 @@ jobs:
     runs-on: aks-linux-2-cores-8gb
     container:
       image: openvinogithubactions.azurecr.io/ov_build/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }}
-      volumes: 
+      volumes:
         - /mount:/mount
         - ${{ github.workspace }}:${{ github.workspace }}
       options: -e SCCACHE_AZURE_BLOB_CONTAINER -e SCCACHE_AZURE_CONNECTION_STRING
@@ -228,17 +228,17 @@ jobs:
           pattern: "{${{ needs.openvino_download.outputs.ov_artifact_name }},genai_archive_${{ matrix.build-type }}}"
           path: ${{ env.OV_INSTALL_DIR }}
           merge-multiple: true
-          
+
       - name: Extract Artifacts
         run: pigz -dc ${{ env.GENAI_ARCHIVE_NAME }} | tar -xf - -C ${{ env.OV_INSTALL_DIR }}
         working-directory: ${{ env.OV_INSTALL_DIR }}
-        
+
       - name: Build Samples (Release)
         if: ${{ 'Release' == matrix.build-type }}
         run: |
           chmod +x ${{ env.OV_INSTALL_DIR }}/samples/cpp/build_samples.sh
           ${{ env.OV_INSTALL_DIR }}/samples/cpp/build_samples.sh -i ${{ env.INSTALL_DIR }}
-  
+
       - name: Build Samples (${{ matrix.build-type }})
         if: ${{ 'Release' != matrix.build-type }}
         run: |
@@ -246,7 +246,7 @@ jobs:
           cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -S ${{ env.OV_INSTALL_DIR }}/samples/cpp/ -B ${{ env.BUILD_DIR }}
           cmake --build ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --parallel $(nproc)
           cmake --install ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --component samples_bin --prefix ${{ env.INSTALL_DIR }}
-        
+
       - name: Pack Artifacts
         run: tar -cvf - * | pigz > ${{ env.INSTALL_DIR }}/${{ env.GENAI_SAMPLES_NAME }}
         working-directory: ${{ env.INSTALL_DIR }}
@@ -258,7 +258,7 @@ jobs:
           name: genai_samples_${{ matrix.build-type }}
           path: ${{ env.INSTALL_DIR }}/*.tar.gz
           if-no-files-found: 'error'
-        
+
   genai_tests_wheel:
     name: Python (${{ matrix.test.name}}) Tests (wheel)
     needs: [ openvino_download, genai_build_wheel ]
@@ -279,7 +279,7 @@ jobs:
     runs-on: aks-linux-4-cores-16gb
     container:
       image: openvinogithubactions.azurecr.io/ov_test/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }}
-      volumes: 
+      volumes:
         - /mount:/mount
         - ${{ github.workspace }}:${{ github.workspace }}
 
@@ -289,39 +289,39 @@ jobs:
       BUILD_DIR: ${{ github.workspace }}/build
       TRANSFORMERS_CACHE: ${{ github.workspace }}/models  # Hugging Face transformers cache
       HF_HOME: ${{ github.workspace }}/datasets           # Hugging Face datasets cache
-        
+
     steps:
       - name: Clone openvino.genai
         uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
         with:
           path: ${{ env.SRC_DIR }}
           submodules: recursive
-          
+
       - name: Download Build Artifacts
         uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
         with:
           pattern: "{${{ needs.openvino_download.outputs.ov_artifact_name }},genai_wheels}"
           path: ${{ env.INSTALL_DIR }}
           merge-multiple: true
-                     
+
       - name: Install GenAI Wheels
         uses: ./src/.github/actions/install_wheel
         with:
           packages: "openvino;openvino_tokenizers[transformers];openvino_genai;whowhatbench"
           requirements_files: "${{ env.SRC_DIR }}/tests/python_tests/requirements.txt"
           local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels
-    
+
       - name: Tests
         run: python -m pytest -v ./${{ matrix.test.cmd }}
         working-directory: ${{ env.SRC_DIR }}
-        
+
   genai_samples_tests:
     name: Samples Tests - ${{ matrix.build-type }}
     strategy:
       fail-fast: false
       matrix:
         build-type: [Release]
-    needs: [ openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples ] 
+    needs: [ openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples ]
     timeout-minutes: 45
     defaults:
       run:
@@ -329,7 +329,7 @@ jobs:
     runs-on: aks-linux-2-cores-8gb
     container:
       image: openvinogithubactions.azurecr.io/ov_test/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }}
-      volumes: 
+      volumes:
         - /mount:/mount
         - ${{ github.workspace }}:${{ github.workspace }}
 
@@ -338,41 +338,41 @@ jobs:
       SRC_DIR: ${{ github.workspace }}/src
       BUILD_DIR: ${{ github.workspace }}/build
       MODELS_DIR: ${{ github.workspace }}/models
-        
+
     steps:
       - name: Clone openvino.genai
         uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
         with:
           path: ${{ env.SRC_DIR }}
           submodules: recursive
-          
+
       - name: Download Build Artifacts
         uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
         with:
           pattern: "{${{ needs.openvino_download.outputs.ov_artifact_name }},genai_archive_${{ matrix.build-type }},genai_samples_${{ matrix.build-type }},genai_wheels}"
           path: ${{ env.INSTALL_DIR }}
           merge-multiple: true
-      
+
       - name: Extract Artifacts
         run: |
           pigz -dc ${{ env.GENAI_ARCHIVE_NAME }} | tar -xf - -C ${{ env.INSTALL_DIR }}
           pigz -dc ${{ env.GENAI_SAMPLES_NAME }} | tar -xf - -C ${{ env.INSTALL_DIR }}
         working-directory: ${{ env.INSTALL_DIR }}
-        
+
       - name: Install Wheels
         uses: ./src/.github/actions/install_wheel
         with:
           packages: "openvino;openvino_tokenizers[transformers];openvino_genai"
           requirements_files: "${{ env.SRC_DIR }}/samples/requirements.txt"
           local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels
-     
+
       - name: Download & convert Models and data
         run: |
           mkdir -p ${{ env.MODELS_DIR }}
           optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 ${{ env.MODELS_DIR }}/TinyLlama-1.1B-Chat-v1.0
           optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny ${{ env.MODELS_DIR }}/whisper-tiny
           wget https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/librispeech_s5/how_are_you_doing_today.wav -O ${{ env.MODELS_DIR }}/how_are_you_doing_today.wav
-        
+
       - name: Test multinomial_causal_lm.py
         if: ${{ 'Release' == matrix.build-type }} # Python bindings can be built in Release only
         timeout-minutes: 1
@@ -384,10 +384,10 @@ jobs:
         timeout-minutes: 1
         run: ${{ env.INSTALL_DIR }}/samples/python/whisper_speech_recognition/whisper_speech_recognition.py ./whisper-tiny/ how_are_you_doing_today.wav
         working-directory: ${{ env.MODELS_DIR }}
-      
+
       - name: C++ Tests Prerequisites
         run: python -m pip uninstall openvino openvino-tokenizers openvino-genai -y
-        
+
       - name: Test greedy_causal_lm
         run: |
           source ${{ env.INSTALL_DIR }}/setupvars.sh
@@ -400,9 +400,93 @@ jobs:
           ${{ env.INSTALL_DIR }}/samples_bin/whisper_speech_recognition ./whisper-tiny/ how_are_you_doing_today.wav
         working-directory: ${{ env.MODELS_DIR }}
 
+  genai_build_nodejs_bindings:
+    name: Build Node.js bindings
+    strategy:
+      fail-fast: false
+      matrix:
+        build-type: [Release]
+    needs: [ openvino_download ]
+    timeout-minutes: 20
+    defaults:
+      run:
+        shell: bash
+    runs-on: aks-linux-4-cores-16gb
+    container:
+      image: openvinogithubactions.azurecr.io/ov_build/ubuntu_22_04_x64:${{ needs.openvino_download.outputs.docker_tag }}
+      volumes:
+        - /mount:/mount
+      options: -e SCCACHE_AZURE_BLOB_CONTAINER -e SCCACHE_AZURE_CONNECTION_STRING -v ${{ github.workspace }}:${{ github.workspace }}
+    env:
+      CMAKE_GENERATOR: Unix Makefiles
+      OV_INSTALL_DIR: ${{ github.workspace }}/ov
+      BUILD_DIR: ${{ github.workspace }}/build
+      SRC_DIR: ${{ github.workspace }}/src
+
+    steps:
+      - name: Clone openvino.genai
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          path: ${{ env.SRC_DIR }}
+          submodules: recursive
+
+      - name: Download OpenVINO package
+        uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
+        with:
+          name: ${{ needs.openvino_download.outputs.ov_artifact_name }}
+          path: ${{ env.OV_INSTALL_DIR }}
+          merge-multiple: true
+
+      - name: Build with ENABLE_JS=ON
+        run: |
+          source ${{ env.OV_INSTALL_DIR }}/setupvars.sh
+          cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -DENABLE_JS=ON -S ${{ env.SRC_DIR }} -B ${{ env.BUILD_DIR }}
+          cmake --build ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --parallel $(nproc) --verbose
+          cmake --install ${{ env.BUILD_DIR }} --config ${{ matrix.build-type }} --prefix ${{ env.OV_INSTALL_DIR }}
+
+      - name: Combine binaries for Node.js package
+        run: |
+          mkdir -p nodejs
+          cp -r runtime/lib/intel64/* nodejs
+          cp -r runtime/3rdparty/tbb/lib/* nodejs
+          cp genai_node_addon.node nodejs
+          GENAI_VERSION=$(grep -oP '(?<=CMAKE_PROJECT_VERSION:STATIC=)[^"]*' ${{ env.BUILD_DIR }}/CMakeCache.txt)
+          OV_VERSION=$(echo $GENAI_VERSION | sed 's/..$//')
+          patchelf --set-rpath '$ORIGIN' nodejs/libopenvino.so.$OV_VERSION nodejs/libopenvino_genai.so.$GENAI_VERSION
+        working-directory: ${{ env.OV_INSTALL_DIR }}
+
+      - name: Pack Node.js bindings libs
+        run: tar -cvf - * | pigz > ${{ env.BUILD_DIR }}/genai_nodejs_bindings.tar.gz
+        working-directory: ${{ env.OV_INSTALL_DIR }}/nodejs
+
+      - name: Upload Archive Package with Node.js bindings
+        if: ${{ always() }}
+        uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3
+        with:
+          name: genai_nodejs_bindings
+          path: ${{ env.BUILD_DIR }}/genai_nodejs_bindings.tar.gz
+          if-no-files-found: 'error'
+
+      - name: Run npm package tests
+        working-directory: ${{ env.SRC_DIR }}/src/js
+        run: |
+          cp -R ${{ env.OV_INSTALL_DIR }}/nodejs bin
+          npm install
+          npm test
+
+      - name: Install genai-node samples dependencies
+        run: npm install
+        working-directory: ${{ env.SRC_DIR }}/samples/js/text_generation
+
+      - name: Run samples tests
+        run: npm test
+        env:
+          MODEL_PATH: ${{ env.SRC_DIR }}/src/js/tests/models/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov
+        working-directory: ${{ env.SRC_DIR }}/samples/js/text_generation
+
   Overall_Status:
     name: ci/gha_overall_status_linux
-    needs: [openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples, genai_tests_wheel, genai_samples_tests]
+    needs: [openvino_download, genai_build_cmake, genai_build_wheel, genai_build_samples, genai_tests_wheel, genai_samples_tests, genai_build_nodejs_bindings]
     if: ${{ always() }}
     runs-on: ubuntu-latest
     steps:
diff --git a/CMakeLists.txt b/CMakeLists.txt
index bb19676da3..ee1cb70f7a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -88,7 +88,7 @@ endif()
 
 add_subdirectory(thirdparty)
 add_subdirectory(src)
-if(EXISTS "${OpenVINOGenAI_SOURCE_DIR}/samples")
+if(EXISTS "${OpenVINOGenAI_SOURCE_DIR}/samples" AND ENABLE_SAMPLES)
     add_subdirectory(samples)
 endif()
 if(EXISTS "${OpenVINOGenAI_SOURCE_DIR}/tools/continuous_batching")
@@ -109,6 +109,9 @@ set(CPACK_COMPONENTS_ALL core_genai core_genai_dev cpp_samples_genai licensing_g
 if(ENABLE_PYTHON)
     list(APPEND CPACK_COMPONENTS_ALL pygenai_${Python3_VERSION_MAJOR}_${Python3_VERSION_MINOR})
 endif()
+if(ENABLE_JS)
+    list(APPEND CPACK_COMPONENTS_ALL genai_node_addon)
+endif()
 if(WIN32 AND NOT DEFINED CPACK_GENERATOR)
     set(CPACK_GENERATOR "ZIP")
 endif()
diff --git a/cmake/features.cmake b/cmake/features.cmake
index 8b2e05472b..3e494e7355 100644
--- a/cmake/features.cmake
+++ b/cmake/features.cmake
@@ -3,3 +3,10 @@
 #
 
 option(ENABLE_PYTHON "Enable Python API build" ON)
+option(ENABLE_JS "Enable JS API build" OFF)
+option(ENABLE_SAMPLES "Enable samples build" ON)
+
+# Disable building samples for NPM package
+if(CPACK_GENERATOR STREQUAL "NPM")
+    set(ENABLE_SAMPLES OFF)
+endif()
diff --git a/pyproject.toml b/pyproject.toml
index c87ae38253..b54face916 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -47,7 +47,7 @@ find_python3 = true
 build_args = ["--parallel", "--target", "py_openvino_genai_stub"]
 install_args = ["--strip"]
 install_components = ["wheel_genai"]
-options = {"BUILD_TOKENIZERS" = "OFF"}
+options = {"BUILD_TOKENIZERS" = "OFF", "ENABLE_SAMPLES" = "OFF"}
 
 [build-system]
 requires = [
diff --git a/samples/js/text_generation/.gitignore b/samples/js/text_generation/.gitignore
new file mode 100644
index 0000000000..3c3629e647
--- /dev/null
+++ b/samples/js/text_generation/.gitignore
@@ -0,0 +1 @@
+node_modules
diff --git a/samples/js/text_generation/README.md b/samples/js/text_generation/README.md
new file mode 100644
index 0000000000..46caba48e3
--- /dev/null
+++ b/samples/js/text_generation/README.md
@@ -0,0 +1,48 @@
+# JavaScript chat_sample that supports most popular models like LLaMA 3
+
+This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `Pipeline.LLMPipeline` and configures it for the chat scenario.
+
+## Download and convert the model and tokenizers
+
+To convert model you have to use python package `optimum-intel`.
+The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
+
+Install [../../export-requirements.txt](../../export-requirements.txt) to convert a model.
+
+```sh
+pip install --upgrade-strategy eager -r ../../export-requirements.txt
+optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
+```
+
+## Run:
+
+Compile GenAI JavaScript bindings archive first using the instructions in [../../../src/js/README.md](../../../src/js/README.md#build-bindings).
+
+Run `npm install` in current folder and then run the sample:
+
+`node chat_sample.js TinyLlama-1.1B-Chat-v1.0`
+
+Discrete GPUs (dGPUs) usually provide better performance compared to CPUs. It is recommended to run larger models on a dGPU with 32GB+ RAM. For example, the model meta-llama/Llama-2-13b-chat-hf can benefit from being run on a dGPU. Modify the source code to change the device for inference to the GPU.
+
+See https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md#supported-models for the list of supported models.
+
+### Troubleshooting
+
+#### Unicode characters encoding error on Windows
+
+Example error:
+```
+UnicodeEncodeError: 'charmap' codec can't encode character '\u25aa' in position 0: character maps to <undefined>
+```
+
+If you encounter the error described in the example when sample is printing output to the Windows console, it is likely due to the default Windows encoding not supporting certain Unicode characters. To resolve this:
+1. Enable Unicode characters for Windows cmd - open `Region` settings from `Control panel`. `Administrative`->`Change system locale`->`Beta: Use Unicode UTF-8 for worldwide language support`->`OK`. Reboot.
+2. Enable UTF-8 mode by setting environment variable `PYTHONIOENCODING="utf8"`.
+
+#### Missing chat template
+
+If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
+The following template can be used as a default, but it may not work properly with every model:
+```
+"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
+```
diff --git a/samples/js/text_generation/chat_sample.js b/samples/js/text_generation/chat_sample.js
new file mode 100644
index 0000000000..cf4c5e7704
--- /dev/null
+++ b/samples/js/text_generation/chat_sample.js
@@ -0,0 +1,54 @@
+import readline from 'readline';
+import { Pipeline } from 'genai-node';
+
+main();
+
+function streamer(subword) {
+  process.stdout.write(subword);
+}
+
+async function main() {
+  const MODEL_PATH = process.argv[2];
+
+  if (!MODEL_PATH) {
+    console.error('Please specify path to model directory\n'
+                  + 'Run command must be: `node chat_sample.js *path_to_model_dir*`');
+    process.exit(1);
+  }
+
+  const device = 'CPU'; // GPU can be used as well
+
+  // Create interface for reading user input from stdin
+  const rl = readline.createInterface({
+    input: process.stdin,
+    output: process.stdout,
+  });
+
+  const pipe = await Pipeline.LLMPipeline(MODEL_PATH, device);
+  const config = { 'max_new_tokens': 100 };
+
+  await pipe.startChat();
+  promptUser();
+
+  // Function to prompt the user for input
+  function promptUser() {
+    rl.question('question:\n', handleInput);
+  }
+
+  // Function to handle user input
+  async function handleInput(input) {
+    input = input.trim();
+
+    // Check for exit command
+    if (!input) {
+      await pipe.finishChat();
+      rl.close();
+      process.exit(0);
+    }
+
+    await pipe.generate(input, config, streamer);
+    console.log('\n----------');
+
+    if (!rl.closed) promptUser();
+  }
+}
diff --git a/samples/js/text_generation/package-lock.json b/samples/js/text_generation/package-lock.json
new file mode 100644
index 0000000000..fbee0db012
--- /dev/null
+++ b/samples/js/text_generation/package-lock.json
@@ -0,0 +1,42 @@
+{
+  "name": "genai-node-demo",
+  "version": "1.0.0",
+  "lockfileVersion": 3,
+  "requires": true,
+  "packages": {
+    "": {
+      "name": "genai-node-demo",
+      "version": "1.0.0",
+      "license": "Apache-2.0",
+      "devDependencies": {
+        "genai-node": "../../../src/js/"
+      },
+      "engines": {
+        "node": ">=21.0.0"
+      }
+    },
+    "../../../src/js": {
+      "name": "genai-node",
+      "version": "2024.5.0-preview",
+      "dev": true,
+      "license": "Apache-2.0",
+      "os": [
+        "linux",
+        "darwin",
+        "win32"
+      ],
+      "devDependencies": {
+        "@huggingface/hub": "^0.21.0",
+        "global-agent": "^3.0.0",
+        "node-fetch": "^3.3.2"
+      },
+      "engines": {
+        "node": ">=21.0.0"
+      }
+    },
+    "node_modules/genai-node": {
+      "resolved": "../../../src/js",
+      "link": true
+    }
+  }
+}
diff --git a/samples/js/text_generation/package.json b/samples/js/text_generation/package.json
new file mode 100644
index 0000000000..24e66a120d
--- /dev/null
+++ b/samples/js/text_generation/package.json
@@ -0,0 +1,15 @@
+{
+  "name": "genai-node-demo",
+  "version": "1.0.0",
+  "license": "Apache-2.0",
+  "type": "module",
+  "devDependencies": {
+    "genai-node": "../../../src/js/"
+  },
+  "engines": {
+    "node": ">=21.0.0"
+  },
+  "scripts": {
+    "test": "node tests/usage.test.js"
+  }
+}
diff --git a/samples/js/text_generation/tests/usage.test.js b/samples/js/text_generation/tests/usage.test.js
new file mode 100644
index 0000000000..fcd58a0b69
--- /dev/null
+++ b/samples/js/text_generation/tests/usage.test.js
@@ -0,0 +1,62 @@
+import { env } from 'process';
+import { spawn } from 'child_process';
+
+const MODEL_PATH = env.MODEL_PATH;
+const prompt = 'Tell me exactly, no changes, print as is: "Hello world"';
+
+if (!MODEL_PATH)
+  throw new Error(
+    'Please environment variable MODEL_PATH to the path of the model directory'
+  );
+
+const runTest = async () => {
+  return new Promise((resolve, reject) => {
+    const script = spawn('node', ['chat_sample.js', MODEL_PATH]);
+    let output = '';
+
+    // Collect output from stdout
+    script.stdout.on('data', (data) => {
+      output += data.toString();
+    });
+
+    // Capture errors
+    script.stderr.on('data', (data) => {
+      reject(data.toString());
+    });
+
+    // Send input after detecting the question prompt
+    script.stdout.once('data', (data) => {
+      if (data.toString().startsWith('question:')) {
+        script.stdin.write(`${prompt}\n`); // Provide input
+        script.stdin.end(); // Close stdin to signal EOF
+      }
+    });
+
+    // Check results when the process exits
+    script.on('close', (code) => {
+      if (code !== 0) {
+        return reject(`Process exited with code ${code}`);
+      }
+
+      // Log the output
+      console.log(`Result output: ${output}`);
+
+      // Validate the output
+      if (typeof output == 'string' && output.length > 0) {
+        resolve('Test passed!');
+      } else {
+        reject('Test failed: Output did not match expected result.');
+      }
+    });
+  });
+};
+
+runTest()
+  .then((message) => {
+    console.log(message);
+    process.exit(0);
+  })
+  .catch((err) => {
+    console.error(err);
+    process.exit(1);
+  });
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 2f615a1b6f..c2ef969838 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -7,3 +7,7 @@ add_subdirectory(cpp)
 if(ENABLE_PYTHON)
     add_subdirectory(python)
 endif()
+
+if(ENABLE_JS)
+    add_subdirectory(js)
+endif()
diff --git a/src/cpp/CMakeLists.txt b/src/cpp/CMakeLists.txt
index 43bca747ec..5c50f55268 100644
--- a/src/cpp/CMakeLists.txt
+++ b/src/cpp/CMakeLists.txt
@@ -147,13 +147,42 @@ if(MSVC OR APPLE)
     set(ARCH_DIR ${ARCH_DIR}/${CMAKE_BUILD_TYPE})
 endif()
 
+# Put binaries at the top level for NPM package
+if(CPACK_GENERATOR STREQUAL "NPM")
+    set(LIBRARY_DESTINATION .)
+    set(ARCHIVE_DESTINATION .)
+    set(RUNTIME_DESTINATION .)
+
+    # setting RPATH / LC_RPATH depending on platform
+    if(LINUX)
+        # to find libopenvino.so in the same folder
+        set(rpaths "$ORIGIN")
+    elseif(APPLE)
+        # to find libopenvino.dylib in the same folder
+        set(rpaths "@loader_path")
+    endif()
+
+    if(rpaths)
+        set_target_properties(${TARGET_NAME} PROPERTIES INSTALL_RPATH "${rpaths}")
+    endif()
+else()
+    set(LIBRARY_DESTINATION runtime/lib/${ARCH_DIR})
+    set(ARCHIVE_DESTINATION runtime/lib/${ARCH_DIR})
+    set(RUNTIME_DESTINATION runtime/bin/${ARCH_DIR})
+endif()
+
 install(TARGETS ${TARGET_NAME} EXPORT OpenVINOGenAITargets
-        LIBRARY DESTINATION runtime/lib/${ARCH_DIR} COMPONENT core_genai
+        LIBRARY DESTINATION ${LIBRARY_DESTINATION} COMPONENT core_genai
             NAMELINK_COMPONENT core_genai_dev
-        ARCHIVE DESTINATION runtime/lib/${ARCH_DIR} COMPONENT core_genai_dev
-        RUNTIME DESTINATION runtime/bin/${ARCH_DIR} COMPONENT core_genai
+        ARCHIVE DESTINATION ${ARCHIVE_DESTINATION} COMPONENT core_genai_dev
+        RUNTIME DESTINATION ${RUNTIME_DESTINATION} COMPONENT core_genai
         INCLUDES DESTINATION runtime/include)
 
+# development files do not need to be built for NPM package
+if(CPACK_GENERATOR STREQUAL "NPM")
+    return()
+endif()
+
 install(DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/include/
         DESTINATION runtime/include COMPONENT core_genai_dev)
 install(FILES ${CMAKE_CURRENT_BINARY_DIR}/openvino/genai/version.hpp
diff --git a/src/js/.gitignore b/src/js/.gitignore
new file mode 100644
index 0000000000..8990d8c418
--- /dev/null
+++ b/src/js/.gitignore
@@ -0,0 +1,6 @@
+.vscode
+bin
+bin.*
+build
+node_modules
+tests/models
diff --git a/src/js/.npmignore b/src/js/.npmignore
new file mode 100644
index 0000000000..9bf3e571b1
--- /dev/null
+++ b/src/js/.npmignore
@@ -0,0 +1,15 @@
+.vscode
+bin.*
+build
+include
+src
+tests
+
+.eslintrc.js
+CMakeLists.txt
+tsconfig.json
+TODO.md
+build.sh
+
+**/*.tsbuildinfo
+*.tgz
diff --git a/src/js/CMakeLists.txt b/src/js/CMakeLists.txt
new file mode 100644
index 0000000000..7e4ff0bea4
--- /dev/null
+++ b/src/js/CMakeLists.txt
@@ -0,0 +1,93 @@
+cmake_minimum_required(VERSION 3.18)
+
+# Set C++ standard
+set(CMAKE_CXX_STANDARD 17)
+
+set(dist_folder "${CMAKE_SOURCE_DIR}/bin/")
+
+if(WIN32)
+    set(CMAKE_SHARED_LINKER_FLAGS /DELAYLOAD:NODE.EXE)
+    set(CMAKE_JS_LIB ${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/node.lib)
+    set(CMAKE_JS_SRC ${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/win_delay_load_hook.cc)
+
+    set(CMAKE_JS_NODELIB_DEF ${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/node-lib.def)
+    set(CMAKE_JS_NODELIB_TARGET ${CMAKE_JS_LIB})
+    set(DELAYIMP_LIB delayimp.lib)
+endif()
+
+project(genai_node_addon)
+
+# Specify NAPI version 8
+# supports v12.22.0+, v14.17.0+, v15.12.0+, 16.0.0 and all later Node.js versions
+add_definitions(-DNAPI_VERSION=8)
+
+include(FetchContent)
+
+FetchContent_Declare(
+    node-api-headers
+    URL      https://github.com/nodejs/node-api-headers/archive/refs/tags/v1.1.0.tar.gz
+    URL_HASH SHA256=70608bc1e6dddce280285f3462f18a106f687c0720a4b90893e1ecd86e5a8bbf
+)
+FetchContent_MakeAvailable(node-api-headers)
+
+FetchContent_Declare(
+    node-addon-api
+    URL      https://github.com/nodejs/node-addon-api/archive/refs/tags/v8.0.0.tar.gz
+    URL_HASH SHA256=42424c5206b9d67b41af4fcff5d6e3cb22074168035a03b8467852938a281d47
+)
+FetchContent_MakeAvailable(node-addon-api)
+
+# Create a library
+add_library(${PROJECT_NAME} SHARED
+    ${CMAKE_CURRENT_SOURCE_DIR}/src/addon.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/llm_pipeline_wrapper.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/finish_chat_worker.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/start_chat_worker.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/src/llm_pipeline/init_worker.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/src/helper.cpp
+
+    ${CMAKE_JS_SRC}
+)
+
+# Include directories
+target_include_directories(${PROJECT_NAME} PRIVATE
+  "${node-api-headers_SOURCE_DIR}/include"
+  "${node-addon-api_SOURCE_DIR}"
+  "${CMAKE_CURRENT_SOURCE_DIR}"
+)
+
+target_link_libraries(${PROJECT_NAME} PRIVATE openvino::genai ${DELAYIMP_LIB} ${CMAKE_JS_LIB})
+
+if(MSVC AND CMAKE_JS_NODELIB_DEF AND CMAKE_JS_NODELIB_TARGET) # Generate node.lib
+  execute_process(COMMAND ${CMAKE_AR} /def:${CMAKE_JS_NODELIB_DEF} /out:${CMAKE_JS_NODELIB_TARGET} ${CMAKE_STATIC_LINKER_FLAGS})
+endif()
+
+if(APPLE)
+    target_link_options(${PROJECT_NAME} PRIVATE -Wl,-undefined,suppress,-flat_namespace)
+elseif(AARCH64 OR ARM)
+    target_link_options(${PROJECT_NAME} PRIVATE -Wl,--unresolved-symbols=ignore-all)
+endif()
+
+# Set library properties
+set_target_properties(${PROJECT_NAME} PROPERTIES
+    PREFIX ""
+    SUFFIX ".node"
+)
+
+# setting RPATH / LC_RPATH depending on platform
+if(LINUX)
+    # to find libopenvino_genai.so in the same folder
+    set(rpaths "$ORIGIN")
+elseif(APPLE)
+    # to find libopenvino_genai.dylib in the same folder
+    set(rpaths "@loader_path")
+endif()
+
+if(rpaths)
+    set_target_properties(${PROJECT_NAME} PROPERTIES INSTALL_RPATH "${rpaths}")
+endif()
+
+install(TARGETS ${PROJECT_NAME}
+    LIBRARY DESTINATION . COMPONENT ${PROJECT_NAME}
+    RUNTIME DESTINATION . COMPONENT ${PROJECT_NAME}
+)
diff --git a/src/js/README.md b/src/js/README.md
new file mode 100644
index 0000000000..f5ccf1117c
--- /dev/null
+++ b/src/js/README.md
@@ -0,0 +1,56 @@
+# OpenVINO™ GenAI Node.js bindings (preview)
+
+## DISCLAIMER
+
+This is preview version, do not use it in production!
+
+## Install and Run
+
+### Requirements
+
+- Node.js v21+
+- Tested on Ubuntu, another OS didn't tested yet
+
+### Build Bindings
+
+#### Build OpenVINO GenAI as OpenVINO Extra Module
+
+OpenVINO GenAI Node.js bindings can be built as an extra module during the OpenVINO build process. This method simplifies the build process by integrating OpenVINO GenAI directly into the OpenVINO build.
+
+1. Clone OpenVINO repository:
+   ```sh
+   git clone --recursive https://github.com/openvinotoolkit/openvino.git
+   ```
+1. Configure CMake with OpenVINO extra modules:
+   ```sh
+   cmake -DOPENVINO_EXTRA_MODULES=*path to genai repository directory* -DCPACK_ARCHIVE_COMPONENT_INSTALL=OFF \
+         -DCPACK_GENERATOR=NPM \
+         -DENABLE_PYTHON=OFF \
+         -DENABLE_WHEEL=OFF \
+         -DCPACK_PACKAGE_FILE_NAME=genai_nodejs_bindings \
+         -S ./openvino -B ./build
+   ```
+1. Build OpenVINO archive with GenAI:
+   ```sh
+   cmake --build ./build --target package -j
+   ```
+
+1. Put Node.js bindings into npm package `bin` directory and install dependencies:
+   ```sh
+   mkdir ./src/js/bin/
+   tar -xvf ./build/genai_nodejs_bindings.tar.gz --directory ./src/js/bin/
+   cd ./src/js/
+   npm install
+   ```
+1. Run tests to be sure that everything works:
+   ```sh
+   npm test
+   ```
+
+### Using as npm Dependency
+
+To use this package locally use `npm link` in `src/js/` directory
+and `npm link genai-node` in the folder where you want to add this package as a dependency
+
+To extract this package and use it as distributed npm package run `npm package`.
+This command creates archive that you may use in your projects.
diff --git a/src/js/include/addon.hpp b/src/js/include/addon.hpp
new file mode 100644
index 0000000000..35e5cc462e
--- /dev/null
+++ b/src/js/include/addon.hpp
@@ -0,0 +1,20 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+
+#pragma once
+
+#include <napi.h>
+
+typedef Napi::Function (*Prototype)(Napi::Env);
+
+struct AddonData {
+    Napi::FunctionReference core;
+};
+
+void init_class(Napi::Env env,
+                Napi::Object exports,
+                std::string class_name,
+                Prototype func,
+                Napi::FunctionReference& reference);
+
+Napi::Object init_module(Napi::Env env, Napi::Object exports);
diff --git a/src/js/include/helper.hpp b/src/js/include/helper.hpp
new file mode 100644
index 0000000000..4a010df019
--- /dev/null
+++ b/src/js/include/helper.hpp
@@ -0,0 +1,23 @@
+#pragma once
+#include <napi.h>
+
+#include "openvino/core/type/element_type.hpp"
+#include "openvino/openvino.hpp"
+
+ov::AnyMap to_anyMap(const Napi::Env&, const Napi::Value&);
+
+/**
+ * @brief  Template function to convert Javascript data types into C++ data types
+ * @tparam TargetType destinated C++ data type
+ * @param info Napi::CallbackInfo contains all arguments passed to a function or method
+ * @param idx specifies index of a argument inside info.
+ * @return specified argument converted to a TargetType.
+ */
+template <typename TargetType>
+TargetType js_to_cpp(const Napi::Env& env, const Napi::Value& value);
+
+/** @brief  A template specialization for TargetType ov::Any */
+template <>
+ov::Any js_to_cpp<ov::Any>(const Napi::Env& env, const Napi::Value& value);
+
+bool is_napi_value_int(const Napi::Env& env, const Napi::Value& num);
diff --git a/src/js/include/llm_pipeline/finish_chat_worker.hpp b/src/js/include/llm_pipeline/finish_chat_worker.hpp
new file mode 100644
index 0000000000..ca80b30aff
--- /dev/null
+++ b/src/js/include/llm_pipeline/finish_chat_worker.hpp
@@ -0,0 +1,18 @@
+#pragma once
+
+#include <napi.h>
+#include "openvino/genai/llm_pipeline.hpp"
+
+using namespace Napi;
+
+class FinishChatWorker : public AsyncWorker {
+ public:
+  FinishChatWorker(Function& callback, std::shared_ptr<ov::genai::LLMPipeline>& pipe);
+  virtual ~FinishChatWorker(){};
+
+  void Execute();
+  void OnOK();
+
+ private:
+  std::shared_ptr<ov::genai::LLMPipeline>& pipe;
+};
diff --git a/src/js/include/llm_pipeline/init_worker.hpp b/src/js/include/llm_pipeline/init_worker.hpp
new file mode 100644
index 0000000000..5fc05969fb
--- /dev/null
+++ b/src/js/include/llm_pipeline/init_worker.hpp
@@ -0,0 +1,21 @@
+#pragma once
+
+#include <napi.h>
+#include "openvino/genai/llm_pipeline.hpp"
+
+using namespace Napi;
+
+class InitWorker : public AsyncWorker {
+ public:
+  InitWorker(Function& callback, std::shared_ptr<ov::genai::LLMPipeline>& pipe,
+    const std::string model_path, std::string device);
+  virtual ~InitWorker(){};
+
+  void Execute();
+  void OnOK();
+
+ private:
+  std::shared_ptr<ov::genai::LLMPipeline>& pipe;
+  std::string model_path;
+  std::string device;
+};
diff --git a/src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp b/src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp
new file mode 100644
index 0000000000..872e9ea023
--- /dev/null
+++ b/src/js/include/llm_pipeline/llm_pipeline_wrapper.hpp
@@ -0,0 +1,27 @@
+#pragma once
+
+#include <thread>
+#include <napi.h>
+#include "openvino/genai/llm_pipeline.hpp"
+
+class LLMPipelineWrapper : public Napi::ObjectWrap<LLMPipelineWrapper> {
+public:
+    LLMPipelineWrapper(const Napi::CallbackInfo& info);
+
+    static Napi::Function get_class(Napi::Env env);
+
+    Napi::Value init(const Napi::CallbackInfo& info);
+    Napi::Value generate(const Napi::CallbackInfo& info);
+    Napi::Value start_chat(const Napi::CallbackInfo& info);
+    Napi::Value finish_chat(const Napi::CallbackInfo& info);
+private:
+    bool is_loaded = false;
+    bool is_initialized = false;
+    bool is_running = false;
+
+    std::string model_path;
+    std::string device;
+
+    std::shared_ptr<ov::genai::LLMPipeline> pipe = nullptr;
+    std::function<bool(std::string)> streamer;
+};
diff --git a/src/js/include/llm_pipeline/start_chat_worker.hpp b/src/js/include/llm_pipeline/start_chat_worker.hpp
new file mode 100644
index 0000000000..fde0cfaa0a
--- /dev/null
+++ b/src/js/include/llm_pipeline/start_chat_worker.hpp
@@ -0,0 +1,18 @@
+#pragma once
+
+#include <napi.h>
+#include "openvino/genai/llm_pipeline.hpp"
+
+using namespace Napi;
+
+class StartChatWorker : public AsyncWorker {
+ public:
+  StartChatWorker(Function& callback, std::shared_ptr<ov::genai::LLMPipeline>& pipe);
+  virtual ~StartChatWorker(){};
+
+  void Execute();
+  void OnOK();
+
+ private:
+  std::shared_ptr<ov::genai::LLMPipeline>& pipe;
+};
diff --git a/src/js/lib/bindings.cjs b/src/js/lib/bindings.cjs
new file mode 100644
index 0000000000..acd9e590b8
--- /dev/null
+++ b/src/js/lib/bindings.cjs
@@ -0,0 +1 @@
+module.exports = require('../bin/genai_node_addon.node');
diff --git a/src/js/lib/module.js b/src/js/lib/module.js
new file mode 100644
index 0000000000..6595ba0de0
--- /dev/null
+++ b/src/js/lib/module.js
@@ -0,0 +1,141 @@
+import util from 'node:util';
+
+import addon from './bindings.cjs';
+
+class LLMPipeline {
+  modelPath = null;
+  device = null;
+  pipeline = null;
+  isInitialized = false;
+  isChatStarted = false;
+
+  constructor(modelPath, device) {
+    this.modelPath = modelPath;
+    this.device = device;
+  }
+
+  async init() {
+    if (this.isInitialized)
+      throw new Error('Pipeline is already initialized');
+
+    this.pipeline = new addon.LLMPipeline();
+
+    const init = util.promisify(this.pipeline.init.bind(this.pipeline));
+    const result = await init(this.modelPath, this.device);
+
+    this.isInitialized = true;
+
+    return result;
+  }
+
+  async startChat() {
+    if (this.isChatStarted)
+      throw new Error('Chat is already started');
+
+    const startChatPromise = util.promisify(
+      this.pipeline.startChat.bind(this.pipeline)
+    );
+    const result = await startChatPromise();
+
+    this.isChatStarted = true;
+
+    return result;
+  }
+  async finishChat() {
+    if (!this.isChatStarted)
+      throw new Error('Chat is not started');
+
+    const finishChatPromise = util.promisify(
+      this.pipeline.finishChat.bind(this.pipeline)
+    );
+    const result = await finishChatPromise();
+
+    this.isChatStarted = false;
+
+    return result;
+  }
+
+  static castOptionsToString(options) {
+    const castedOptions = {};
+
+    for (const key in options)
+      castedOptions[key] = String(options[key]);
+
+    return castedOptions;
+  }
+
+  getAsyncGenerator(prompt, generationOptions = {}) {
+    if (!this.isInitialized)
+      throw new Error('Pipeline is not initialized');
+
+    if (typeof prompt !== 'string')
+      throw new Error('Prompt must be a string');
+    if (typeof generationOptions !== 'object')
+      throw new Error('Options must be an object');
+
+    const castedOptions = LLMPipeline.castOptionsToString(generationOptions);
+
+    const queue = [];
+    let resolvePromise;
+
+    // Callback function that C++ will call when a chunk is ready
+    function chunkOutput(isDone, subword) {
+      if (resolvePromise) {
+        resolvePromise({ value: subword, done: isDone }); // Fulfill pending request
+        resolvePromise = null;  // Reset promise resolver
+      } else {
+        queue.push({ isDone, subword }); // Add data to queue if no pending promise
+      }
+    }
+
+    this.pipeline.generate(prompt, chunkOutput, castedOptions);
+
+    return {
+      async next() {
+        // If there is data in the queue, return it
+        // Otherwise, return a promise that will resolve when data is available
+        if (queue.length > 0) {
+          const { isDone, subword } = queue.shift();
+
+          return { value: subword, done: isDone };
+        }
+
+        return new Promise((resolve) => (resolvePromise = resolve));
+      },
+      [Symbol.asyncIterator]() { return this; }
+    };
+  }
+
+  async generate(prompt, generationOptions, generationCallback) {
+    const options = generationOptions || {};
+
+    if (generationCallback !== undefined && typeof generationCallback !== 'function')
+      throw new Error('Generation callback must be a function');
+
+    const g = this.getAsyncGenerator(prompt, options);
+    const result = [];
+
+    for await (const chunk of g) {
+      result.push(chunk);
+
+      if (generationCallback) generationCallback(chunk);
+    }
+
+    return result.join('');
+  }
+}
+
+class Pipeline {
+  static async LLMPipeline(modelPath, device = 'CPU') {
+    const pipeline = new LLMPipeline(modelPath, device);
+    await pipeline.init();
+
+    return pipeline;
+  }
+}
+
+
+export {
+  addon,
+  Pipeline,
+};
diff --git a/src/js/package-lock.json b/src/js/package-lock.json
new file mode 100644
index 0000000000..4da5b57ea7
--- /dev/null
+++ b/src/js/package-lock.json
@@ -0,0 +1,470 @@
+{
+  "name": "genai-node",
+  "version": "2024.5.0-preview",
+  "lockfileVersion": 3,
+  "requires": true,
+  "packages": {
+    "": {
+      "name": "genai-node",
+      "version": "2024.5.0-preview",
+      "license": "Apache-2.0",
+      "os": [
+        "linux",
+        "darwin",
+        "win32"
+      ],
+      "devDependencies": {
+        "@huggingface/hub": "^0.21.0",
+        "global-agent": "^3.0.0",
+        "node-fetch": "^3.3.2"
+      },
+      "engines": {
+        "node": ">=21.0.0"
+      }
+    },
+    "node_modules/@huggingface/hub": {
+      "version": "0.21.0",
+      "resolved": "https://registry.npmjs.org/@huggingface/hub/-/hub-0.21.0.tgz",
+      "integrity": "sha512-DpitNhqobMJLTv8dUq/EMtrz1dpfs3UrSVCxe1aKpjLAdOs6Gm6rqrinUFNvC9G88RIRzIYzojUtYUqlkKwKnA==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "@huggingface/tasks": "^0.13.3"
+      },
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@huggingface/tasks": {
+      "version": "0.13.4",
+      "resolved": "https://registry.npmjs.org/@huggingface/tasks/-/tasks-0.13.4.tgz",
+      "integrity": "sha512-LETHbMSK3gHBFU0D09ziEJm6t1Pcgii4SFwHw+d+8MFGfkAryxaDl2qaHY+PxiTkZEeaTLd6G8/239SJuVxyWg==",
+      "dev": true,
+      "license": "MIT"
+    },
+    "node_modules/boolean": {
+      "version": "3.2.0",
+      "resolved": "https://registry.npmjs.org/boolean/-/boolean-3.2.0.tgz",
+      "integrity": "sha512-d0II/GO9uf9lfUHH2BQsjxzRJZBdsjgsBiW4BvhWk/3qoKwQFjIDVN19PfX8F2D/r9PCMTtLWjYVCFrpeYUzsw==",
+      "deprecated": "Package no longer supported. Contact Support at https://www.npmjs.com/support for more info.",
+      "dev": true,
+      "license": "MIT"
+    },
+    "node_modules/data-uri-to-buffer": {
+      "version": "4.0.1",
+      "resolved": "https://registry.npmjs.org/data-uri-to-buffer/-/data-uri-to-buffer-4.0.1.tgz",
+      "integrity": "sha512-0R9ikRb668HB7QDxT1vkpuUBtqc53YyAwMwGeUFKRojY/NWKvdZ+9UYtRfGmhqNbRkTSVpMbmyhXipFFv2cb/A==",
+      "dev": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">= 12"
+      }
+    },
+    "node_modules/define-data-property": {
+      "version": "1.1.4",
+      "resolved": "https://registry.npmjs.org/define-data-property/-/define-data-property-1.1.4.tgz",
+      "integrity": "sha512-rBMvIzlpA8v6E+SJZoo++HAYqsLrkg7MSfIinMPFhmkorw7X+dOXVJQs+QT69zGkzMyfDnIMN2Wid1+NbL3T+A==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "es-define-property": "^1.0.0",
+        "es-errors": "^1.3.0",
+        "gopd": "^1.0.1"
+      },
+      "engines": {
+        "node": ">= 0.4"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/define-properties": {
+      "version": "1.2.1",
+      "resolved": "https://registry.npmjs.org/define-properties/-/define-properties-1.2.1.tgz",
+      "integrity": "sha512-8QmQKqEASLd5nx0U1B1okLElbUuuttJ/AnYmRXbbbGDWh6uS208EjD4Xqq/I9wK7u0v6O08XhTWnt5XtEbR6Dg==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "define-data-property": "^1.0.1",
+        "has-property-descriptors": "^1.0.0",
+        "object-keys": "^1.1.1"
+      },
+      "engines": {
+        "node": ">= 0.4"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/detect-node": {
+      "version": "2.1.0",
+      "resolved": "https://registry.npmjs.org/detect-node/-/detect-node-2.1.0.tgz",
+      "integrity": "sha512-T0NIuQpnTvFDATNuHN5roPwSBG83rFsuO+MXXH9/3N1eFbn4wcPjttvjMLEPWJ0RGUYgQE7cGgS3tNxbqCGM7g==",
+      "dev": true,
+      "license": "MIT"
+    },
+    "node_modules/es-define-property": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/es-define-property/-/es-define-property-1.0.0.tgz",
+      "integrity": "sha512-jxayLKShrEqqzJ0eumQbVhTYQM27CfT1T35+gCgDFoL82JLsXqTJ76zv6A0YLOgEnLUMvLzsDsGIrl8NFpT2gQ==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "get-intrinsic": "^1.2.4"
+      },
+      "engines": {
+        "node": ">= 0.4"
+      }
+    },
+    "node_modules/es-errors": {
+      "version": "1.3.0",
+      "resolved": "https://registry.npmjs.org/es-errors/-/es-errors-1.3.0.tgz",
+      "integrity": "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw==",
+      "dev": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">= 0.4"
+      }
+    },
+    "node_modules/es6-error": {
+      "version": "4.1.1",
+      "resolved": "https://registry.npmjs.org/es6-error/-/es6-error-4.1.1.tgz",
+      "integrity": "sha512-Um/+FxMr9CISWh0bi5Zv0iOD+4cFh5qLeks1qhAopKVAJw3drgKbKySikp7wGhDL0HPeaja0P5ULZrxLkniUVg==",
+      "dev": true,
+      "license": "MIT"
+    },
+    "node_modules/escape-string-regexp": {
+      "version": "4.0.0",
+      "resolved": "https://registry.npmjs.org/escape-string-regexp/-/escape-string-regexp-4.0.0.tgz",
+      "integrity": "sha512-TtpcNJ3XAzx3Gq8sWRzJaVajRs0uVxA2YAkdb1jm2YkPz4G6egUFAyA3n5vtEIZefPk5Wa4UXbKuS5fKkJWdgA==",
+      "dev": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">=10"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/fetch-blob": {
+      "version": "3.2.0",
+      "resolved": "https://registry.npmjs.org/fetch-blob/-/fetch-blob-3.2.0.tgz",
+      "integrity": "sha512-7yAQpD2UMJzLi1Dqv7qFYnPbaPx7ZfFK6PiIxQ4PfkGPyNyl2Ugx+a/umUonmKqjhM4DnfbMvdX6otXq83soQQ==",
+      "dev": true,
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/jimmywarting"
+        },
+        {
+          "type": "paypal",
+          "url": "https://paypal.me/jimmywarting"
+        }
+      ],
+      "license": "MIT",
+      "dependencies": {
+        "node-domexception": "^1.0.0",
+        "web-streams-polyfill": "^3.0.3"
+      },
+      "engines": {
+        "node": "^12.20 || >= 14.13"
+      }
+    },
+    "node_modules/formdata-polyfill": {
+      "version": "4.0.10",
+      "resolved": "https://registry.npmjs.org/formdata-polyfill/-/formdata-polyfill-4.0.10.tgz",
+      "integrity": "sha512-buewHzMvYL29jdeQTVILecSaZKnt/RJWjoZCF5OW60Z67/GmSLBkOFM7qh1PI3zFNtJbaZL5eQu1vLfazOwj4g==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "fetch-blob": "^3.1.2"
+      },
+      "engines": {
+        "node": ">=12.20.0"
+      }
+    },
+    "node_modules/function-bind": {
+      "version": "1.1.2",
+      "resolved": "https://registry.npmjs.org/function-bind/-/function-bind-1.1.2.tgz",
+      "integrity": "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA==",
+      "dev": true,
+      "license": "MIT",
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/get-intrinsic": {
+      "version": "1.2.4",
+      "resolved": "https://registry.npmjs.org/get-intrinsic/-/get-intrinsic-1.2.4.tgz",
+      "integrity": "sha512-5uYhsJH8VJBTv7oslg4BznJYhDoRI6waYCxMmCdnTrcCrHA/fCFKoTFz2JKKE0HdDFUF7/oQuhzumXJK7paBRQ==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "es-errors": "^1.3.0",
+        "function-bind": "^1.1.2",
+        "has-proto": "^1.0.1",
+        "has-symbols": "^1.0.3",
+        "hasown": "^2.0.0"
+      },
+      "engines": {
+        "node": ">= 0.4"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/global-agent": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/global-agent/-/global-agent-3.0.0.tgz",
+      "integrity": "sha512-PT6XReJ+D07JvGoxQMkT6qji/jVNfX/h364XHZOWeRzy64sSFr+xJ5OX7LI3b4MPQzdL4H8Y8M0xzPpsVMwA8Q==",
+      "dev": true,
+      "license": "BSD-3-Clause",
+      "dependencies": {
+        "boolean": "^3.0.1",
+        "es6-error": "^4.1.1",
+        "matcher": "^3.0.0",
+        "roarr": "^2.15.3",
+        "semver": "^7.3.2",
+        "serialize-error": "^7.0.1"
+      },
+      "engines": {
+        "node": ">=10.0"
+      }
+    },
+    "node_modules/globalthis": {
+      "version": "1.0.4",
+      "resolved": "https://registry.npmjs.org/globalthis/-/globalthis-1.0.4.tgz",
+      "integrity": "sha512-DpLKbNU4WylpxJykQujfCcwYWiV/Jhm50Goo0wrVILAv5jOr9d+H+UR3PhSCD2rCCEIg0uc+G+muBTwD54JhDQ==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "define-properties": "^1.2.1",
+        "gopd": "^1.0.1"
+      },
+      "engines": {
+        "node": ">= 0.4"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/gopd": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/gopd/-/gopd-1.0.1.tgz",
+      "integrity": "sha512-d65bNlIadxvpb/A2abVdlqKqV563juRnZ1Wtk6s1sIR8uNsXR70xqIzVqxVf1eTqDunwT2MkczEeaezCKTZhwA==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "get-intrinsic": "^1.1.3"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/has-property-descriptors": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/has-property-descriptors/-/has-property-descriptors-1.0.2.tgz",
+      "integrity": "sha512-55JNKuIW+vq4Ke1BjOTjM2YctQIvCT7GFzHwmfZPGo5wnrgkid0YQtnAleFSqumZm4az3n2BS+erby5ipJdgrg==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "es-define-property": "^1.0.0"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/has-proto": {
+      "version": "1.0.3",
+      "resolved": "https://registry.npmjs.org/has-proto/-/has-proto-1.0.3.tgz",
+      "integrity": "sha512-SJ1amZAJUiZS+PhsVLf5tGydlaVB8EdFpaSO4gmiUKUOxk8qzn5AIy4ZeJUmh22znIdk/uMAUT2pl3FxzVUH+Q==",
+      "dev": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">= 0.4"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/has-symbols": {
+      "version": "1.0.3",
+      "resolved": "https://registry.npmjs.org/has-symbols/-/has-symbols-1.0.3.tgz",
+      "integrity": "sha512-l3LCuF6MgDNwTDKkdYGEihYjt5pRPbEg46rtlmnSPlUbgmB8LOIrKJbYYFBSbnPaJexMKtiPO8hmeRjRz2Td+A==",
+      "dev": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">= 0.4"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/ljharb"
+      }
+    },
+    "node_modules/hasown": {
+      "version": "2.0.2",
+      "resolved": "https://registry.npmjs.org/hasown/-/hasown-2.0.2.tgz",
+      "integrity": "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "function-bind": "^1.1.2"
+      },
+      "engines": {
+        "node": ">= 0.4"
+      }
+    },
+    "node_modules/json-stringify-safe": {
+      "version": "5.0.1",
+      "resolved": "https://registry.npmjs.org/json-stringify-safe/-/json-stringify-safe-5.0.1.tgz",
+      "integrity": "sha512-ZClg6AaYvamvYEE82d3Iyd3vSSIjQ+odgjaTzRuO3s7toCdFKczob2i0zCh7JE8kWn17yvAWhUVxvqGwUalsRA==",
+      "dev": true,
+      "license": "ISC"
+    },
+    "node_modules/matcher": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/matcher/-/matcher-3.0.0.tgz",
+      "integrity": "sha512-OkeDaAZ/bQCxeFAozM55PKcKU0yJMPGifLwV4Qgjitu+5MoAfSQN4lsLJeXZ1b8w0x+/Emda6MZgXS1jvsapng==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "escape-string-regexp": "^4.0.0"
+      },
+      "engines": {
+        "node": ">=10"
+      }
+    },
+    "node_modules/node-domexception": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/node-domexception/-/node-domexception-1.0.0.tgz",
+      "integrity": "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ==",
+      "dev": true,
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/jimmywarting"
+        },
+        {
+          "type": "github",
+          "url": "https://paypal.me/jimmywarting"
+        }
+      ],
+      "license": "MIT",
+      "engines": {
+        "node": ">=10.5.0"
+      }
+    },
+    "node_modules/node-fetch": {
+      "version": "3.3.2",
+      "resolved": "https://registry.npmjs.org/node-fetch/-/node-fetch-3.3.2.tgz",
+      "integrity": "sha512-dRB78srN/l6gqWulah9SrxeYnxeddIG30+GOqK/9OlLVyLg3HPnr6SqOWTWOXKRwC2eGYCkZ59NNuSgvSrpgOA==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "data-uri-to-buffer": "^4.0.0",
+        "fetch-blob": "^3.1.4",
+        "formdata-polyfill": "^4.0.10"
+      },
+      "engines": {
+        "node": "^12.20.0 || ^14.13.1 || >=16.0.0"
+      },
+      "funding": {
+        "type": "opencollective",
+        "url": "https://opencollective.com/node-fetch"
+      }
+    },
+    "node_modules/object-keys": {
+      "version": "1.1.1",
+      "resolved": "https://registry.npmjs.org/object-keys/-/object-keys-1.1.1.tgz",
+      "integrity": "sha512-NuAESUOUMrlIXOfHKzD6bpPu3tYt3xvjNdRIQ+FeT0lNb4K8WR70CaDxhuNguS2XG+GjkyMwOzsN5ZktImfhLA==",
+      "dev": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">= 0.4"
+      }
+    },
+    "node_modules/roarr": {
+      "version": "2.15.4",
+      "resolved": "https://registry.npmjs.org/roarr/-/roarr-2.15.4.tgz",
+      "integrity": "sha512-CHhPh+UNHD2GTXNYhPWLnU8ONHdI+5DI+4EYIAOaiD63rHeYlZvyh8P+in5999TTSFgUYuKUAjzRI4mdh/p+2A==",
+      "dev": true,
+      "license": "BSD-3-Clause",
+      "dependencies": {
+        "boolean": "^3.0.1",
+        "detect-node": "^2.0.4",
+        "globalthis": "^1.0.1",
+        "json-stringify-safe": "^5.0.1",
+        "semver-compare": "^1.0.0",
+        "sprintf-js": "^1.1.2"
+      },
+      "engines": {
+        "node": ">=8.0"
+      }
+    },
+    "node_modules/semver": {
+      "version": "7.6.3",
+      "resolved": "https://registry.npmjs.org/semver/-/semver-7.6.3.tgz",
+      "integrity": "sha512-oVekP1cKtI+CTDvHWYFUcMtsK/00wmAEfyqKfNdARm8u1wNVhSgaX7A8d4UuIlUI5e84iEwOhs7ZPYRmzU9U6A==",
+      "dev": true,
+      "license": "ISC",
+      "bin": {
+        "semver": "bin/semver.js"
+      },
+      "engines": {
+        "node": ">=10"
+      }
+    },
+    "node_modules/semver-compare": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/semver-compare/-/semver-compare-1.0.0.tgz",
+      "integrity": "sha512-YM3/ITh2MJ5MtzaM429anh+x2jiLVjqILF4m4oyQB18W7Ggea7BfqdH/wGMK7dDiMghv/6WG7znWMwUDzJiXow==",
+      "dev": true,
+      "license": "MIT"
+    },
+    "node_modules/serialize-error": {
+      "version": "7.0.1",
+      "resolved": "https://registry.npmjs.org/serialize-error/-/serialize-error-7.0.1.tgz",
+      "integrity": "sha512-8I8TjW5KMOKsZQTvoxjuSIa7foAwPWGOts+6o7sgjz41/qMD9VQHEDxi6PBvK2l0MXUmqZyNpUK+T2tQaaElvw==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "type-fest": "^0.13.1"
+      },
+      "engines": {
+        "node": ">=10"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/sprintf-js": {
+      "version": "1.1.3",
+      "resolved": "https://registry.npmjs.org/sprintf-js/-/sprintf-js-1.1.3.tgz",
+      "integrity": "sha512-Oo+0REFV59/rz3gfJNKQiBlwfHaSESl1pcGyABQsnnIfWOFt6JNj5gCog2U6MLZ//IGYD+nA8nI+mTShREReaA==",
+      "dev": true,
+      "license": "BSD-3-Clause"
+    },
+    "node_modules/type-fest": {
+      "version": "0.13.1",
+      "resolved": "https://registry.npmjs.org/type-fest/-/type-fest-0.13.1.tgz",
+      "integrity": "sha512-34R7HTnG0XIJcBSn5XhDd7nNFPRcXYRZrBB2O2jdKqYODldSzBAqzsWoZYYvduky73toYS/ESqxPvkDf/F0XMg==",
+      "dev": true,
+      "license": "(MIT OR CC0-1.0)",
+      "engines": {
+        "node": ">=10"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/sindresorhus"
+      }
+    },
+    "node_modules/web-streams-polyfill": {
+      "version": "3.3.3",
+      "resolved": "https://registry.npmjs.org/web-streams-polyfill/-/web-streams-polyfill-3.3.3.tgz",
+      "integrity": "sha512-d2JWLCivmZYTSIoge9MsgFCZrt571BikcWGYkjC1khllbTeDlGqZ2D8vD8E/lJa8WGWbb7Plm8/XJYV7IJHZZw==",
+      "dev": true,
+      "license": "MIT",
+      "engines": {
+        "node": ">= 8"
+      }
+    }
+  }
+}
diff --git a/src/js/package.json b/src/js/package.json
new file mode 100644
index 0000000000..5b069fb01f
--- /dev/null
+++ b/src/js/package.json
@@ -0,0 +1,30 @@
+{
+  "name": "genai-node",
+  "type": "module",
+  "version": "2024.5.0-preview",
+  "description": "OpenVINO™ GenAI pipelines for using from Node.js environment",
+  "license": "Apache-2.0",
+  "main": "./lib/module.js",
+  "os": [
+    "linux",
+    "darwin",
+    "win32"
+  ],
+  "engines": {
+    "node": ">=21.0.0"
+  },
+  "keywords": [
+    "OpenVINO",
+    "OpenVINO GenAI",
+    "GenAI"
+  ],
+  "scripts": {
+    "test_setup": "node ./tests/setup.js",
+    "test": "npm run test_setup && node --test ./tests/*.test.js"
+  },
+  "devDependencies": {
+    "node-fetch": "^3.3.2",
+    "global-agent": "^3.0.0",
+    "@huggingface/hub": "^0.21.0"
+  }
+}
diff --git a/src/js/src/addon.cpp b/src/js/src/addon.cpp
new file mode 100644
index 0000000000..4bd1da7bb6
--- /dev/null
+++ b/src/js/src/addon.cpp
@@ -0,0 +1,30 @@
+#include <napi.h>
+#include <thread>
+
+#include "include/addon.hpp"
+
+#include "include/llm_pipeline/llm_pipeline_wrapper.hpp"
+
+void init_class(Napi::Env env,
+                Napi::Object exports,
+                std::string class_name,
+                Prototype func,
+                Napi::FunctionReference& reference) {
+    const auto& prototype = func(env);
+
+    reference = Napi::Persistent(prototype);
+    exports.Set(class_name, prototype);
+}
+
+// Define the addon initialization function
+Napi::Object init_module(Napi::Env env, Napi::Object exports) {
+    auto addon_data = new AddonData();
+    env.SetInstanceData<AddonData>(addon_data);
+
+    init_class(env, exports, "LLMPipeline", &LLMPipelineWrapper::get_class, addon_data->core);
+
+    return exports;
+}
+
+// Register the addon with Node.js
+NODE_API_MODULE(genai-node, init_module)
diff --git a/src/js/src/helper.cpp b/src/js/src/helper.cpp
new file mode 100644
index 0000000000..106994603b
--- /dev/null
+++ b/src/js/src/helper.cpp
@@ -0,0 +1,53 @@
+#include "include/helper.hpp"
+
+ov::AnyMap to_anyMap(const Napi::Env& env, const Napi::Value& val) {
+    ov::AnyMap properties;
+    if (!val.IsObject()) {
+        OPENVINO_THROW("Passed Napi::Value must be an object.");
+    }
+    const auto& parameters = val.ToObject();
+    const auto& keys = parameters.GetPropertyNames();
+
+    for (uint32_t i = 0; i < keys.Length(); ++i) {
+        const auto& property_name = static_cast<Napi::Value>(keys[i]).ToString().Utf8Value();
+
+        const auto& any_value = js_to_cpp<ov::Any>(env, parameters.Get(property_name));
+
+        properties.insert(std::make_pair(property_name, any_value));
+    }
+
+    return properties;
+}
+
+template <>
+ov::Any js_to_cpp<ov::Any>(const Napi::Env& env, const Napi::Value& value) {
+    if (value.IsString()) {
+        return ov::Any(value.ToString().Utf8Value());
+    } else if (value.IsBigInt()) {
+        Napi::BigInt big_value = value.As<Napi::BigInt>();
+        bool is_lossless;
+        int64_t big_num = big_value.Int64Value(&is_lossless);
+
+        if (!is_lossless) {
+            OPENVINO_THROW("Result of BigInt conversion to int64_t results in a loss of precision");
+        }
+
+        return ov::Any(big_num);
+    } else if (value.IsNumber()) {
+        Napi::Number num = value.ToNumber();
+
+        if (is_napi_value_int(env, value)) {
+            return ov::Any(num.Int32Value());
+        } else {
+            return ov::Any(num.DoubleValue());
+        }
+    } else if (value.IsBoolean()) {
+        return ov::Any(value.ToBoolean());
+    } else {
+        OPENVINO_THROW("Cannot convert to ov::Any");
+    }
+}
+
+bool is_napi_value_int(const Napi::Env& env, const Napi::Value& num) {
+    return env.Global().Get("Number").ToObject().Get("isInteger").As<Napi::Function>().Call({num}).ToBoolean().Value();
+}
diff --git a/src/js/src/llm_pipeline/finish_chat_worker.cpp b/src/js/src/llm_pipeline/finish_chat_worker.cpp
new file mode 100644
index 0000000000..b07284688c
--- /dev/null
+++ b/src/js/src/llm_pipeline/finish_chat_worker.cpp
@@ -0,0 +1,14 @@
+#include "include/llm_pipeline/finish_chat_worker.hpp"
+#include <chrono>
+#include <thread>
+
+FinishChatWorker::FinishChatWorker(Function& callback, std::shared_ptr<ov::genai::LLMPipeline>& pipe)
+    : AsyncWorker(callback), pipe(pipe) {};
+
+void FinishChatWorker::Execute() {
+  this->pipe->finish_chat();
+};
+
+void FinishChatWorker::OnOK() {
+  Callback().Call({ Env().Null() });
+};
diff --git a/src/js/src/llm_pipeline/init_worker.cpp b/src/js/src/llm_pipeline/init_worker.cpp
new file mode 100644
index 0000000000..87dd1aaf34
--- /dev/null
+++ b/src/js/src/llm_pipeline/init_worker.cpp
@@ -0,0 +1,18 @@
+#include "include/llm_pipeline/init_worker.hpp"
+#include <chrono>
+#include <thread>
+
+InitWorker::InitWorker(
+  Function& callback,
+  std::shared_ptr<ov::genai::LLMPipeline>& pipe,
+  const std::string model_path,
+  const std::string device
+) : AsyncWorker(callback), pipe(pipe), model_path(model_path), device(device) {};
+
+void InitWorker::Execute() {
+  this->pipe = std::make_shared<ov::genai::LLMPipeline>(this->model_path, this->device);
+};
+
+void InitWorker::OnOK() {
+  Callback().Call({ Env().Null() });
+};
diff --git a/src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp b/src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp
new file mode 100644
index 0000000000..47bc9b352b
--- /dev/null
+++ b/src/js/src/llm_pipeline/llm_pipeline_wrapper.cpp
@@ -0,0 +1,153 @@
+#include "include/helper.hpp"
+
+#include "include/llm_pipeline/llm_pipeline_wrapper.hpp"
+#include "include/llm_pipeline/start_chat_worker.hpp"
+#include "include/llm_pipeline/finish_chat_worker.hpp"
+#include "include/llm_pipeline/init_worker.hpp"
+
+struct TsfnContext {
+    TsfnContext(std::string prompt) : prompt(prompt) {};
+    ~TsfnContext() {};
+
+    std::thread native_thread;
+    Napi::ThreadSafeFunction tsfn;
+
+    std::string prompt;
+    std::shared_ptr<ov::genai::LLMPipeline> pipe = nullptr;
+    std::shared_ptr<ov::AnyMap> options = nullptr;
+};
+
+void performInferenceThread(TsfnContext* context) {
+    try {
+        ov::genai::GenerationConfig config;
+        config.update_generation_config(*context->options);
+
+        std::function<bool(std::string)> streamer = [context](std::string word) {
+            napi_status status = context->tsfn.BlockingCall([word](Napi::Env env, Napi::Function jsCallback) {
+                try {
+                    jsCallback.Call({
+                        Napi::Boolean::New(env, false),
+                        Napi::String::New(env, word)
+                    });
+                } catch(std::exception& err) {
+                    Napi::Error::Fatal("performInferenceThread callback error. Details:" , err.what());
+                }
+            });
+            if (status != napi_ok) {
+                // Handle error
+                Napi::Error::Fatal("performInferenceThread error", "napi_status != napi_ok");
+            }
+
+            // Return flag corresponds whether generation should be stopped.
+            // false means continue generation.
+            return false;
+        };
+
+        context->pipe->generate(context->prompt, config, streamer);
+        napi_status status = context->tsfn.BlockingCall([](Napi::Env env, Napi::Function jsCallback) {
+            jsCallback.Call({
+                Napi::Boolean::New(env, true),
+            });
+        });
+
+        if (status != napi_ok) {
+            // Handle error
+            Napi::Error::Fatal("performInferenceThread error", "napi_status != napi_ok");
+        }
+
+        context->tsfn.Release();
+    }
+    catch(std::exception& e) {
+        Napi::Error::Fatal("performInferenceThread error" , e.what());
+
+        context->tsfn.Release();
+    }
+}
+
+LLMPipelineWrapper::LLMPipelineWrapper(const Napi::CallbackInfo& info) : Napi::ObjectWrap<LLMPipelineWrapper>(info) {};
+
+Napi::Function LLMPipelineWrapper::get_class(Napi::Env env) {
+    return DefineClass(env,
+                       "LLMPipeline",
+                       {InstanceMethod("init", &LLMPipelineWrapper::init),
+                        InstanceMethod("generate", &LLMPipelineWrapper::generate),
+                        InstanceMethod("startChat", &LLMPipelineWrapper::start_chat),
+                        InstanceMethod("finishChat", &LLMPipelineWrapper::finish_chat)});
+}
+
+Napi::Value LLMPipelineWrapper::init(const Napi::CallbackInfo& info) {
+    Napi::Env env = info.Env();
+    const std::string model_path = info[0].ToString();
+    const std::string device = info[1].ToString();
+    Napi::Function callback = info[2].As<Napi::Function>();
+
+    InitWorker* asyncWorker = new InitWorker(callback, this->pipe, model_path, device);
+    asyncWorker->Queue();
+
+    return info.Env().Undefined();
+}
+
+Napi::Value LLMPipelineWrapper::generate(const Napi::CallbackInfo& info) {
+    Napi::Env env = info.Env();
+    TsfnContext* context = nullptr;
+
+    try {
+        std::string prompt = info[0].ToString();
+        ov::AnyMap options;
+        if (info.Length() == 3) {
+            options = to_anyMap(info.Env(), info[2]);
+        }
+
+        context = new TsfnContext(prompt);
+        context->pipe = this->pipe;
+        context->options = std::make_shared<ov::AnyMap>(options);
+        // Create a ThreadSafeFunction
+        context->tsfn = Napi::ThreadSafeFunction::New(
+            env,
+            info[1].As<Napi::Function>(),   // JavaScript function called asynchronously
+            "TSFN",                         // Name
+            0,                              // Unlimited queue
+            1,                              // Only one thread will use this initially
+            [context](Napi::Env) {          // Finalizer used to clean threads up
+                // std::cout << "Finalize TFSN" << std::endl;
+                context->native_thread.join();
+                delete context;
+            }
+        );
+        context->native_thread = std::thread(performInferenceThread, context);
+
+        return Napi::Boolean::New(env, false);
+    } catch(Napi::TypeError& type_err) {
+        throw type_err;
+    } catch(std::exception& err) {
+        std::cout << "Catch in the thread: '" << err.what() << "'" << std::endl;
+        if (context != nullptr) {
+            context->tsfn.Release();
+        }
+
+        throw Napi::Error::New(env, err.what());
+    }
+
+    return Napi::Boolean::New(env, true);
+}
+
+Napi::Value LLMPipelineWrapper::start_chat(const Napi::CallbackInfo& info) {
+    Napi::Env env = info.Env();
+    Napi::Function callback = info[0].As<Napi::Function>();
+
+    StartChatWorker* asyncWorker = new StartChatWorker(callback, this->pipe);
+    asyncWorker->Queue();
+
+    return info.Env().Undefined();
+}
+
+Napi::Value LLMPipelineWrapper::finish_chat(const Napi::CallbackInfo& info) {
+    Napi::Env env = info.Env();
+
+    Napi::Function callback = info[0].As<Napi::Function>();
+
+    FinishChatWorker* asyncWorker = new FinishChatWorker(callback, this->pipe);
+    asyncWorker->Queue();
+
+    return info.Env().Undefined();
+}
diff --git a/src/js/src/llm_pipeline/start_chat_worker.cpp b/src/js/src/llm_pipeline/start_chat_worker.cpp
new file mode 100644
index 0000000000..302c505105
--- /dev/null
+++ b/src/js/src/llm_pipeline/start_chat_worker.cpp
@@ -0,0 +1,14 @@
+#include "include/llm_pipeline/start_chat_worker.hpp"
+#include <chrono>
+#include <thread>
+
+StartChatWorker::StartChatWorker(Function& callback, std::shared_ptr<ov::genai::LLMPipeline>& pipe)
+    : AsyncWorker(callback), pipe(pipe) {};
+
+void StartChatWorker::Execute() {
+  this->pipe->start_chat();
+};
+
+void StartChatWorker::OnOK() {
+  Callback().Call({ Env().Null() });
+};
diff --git a/src/js/tests/bindings.test.js b/src/js/tests/bindings.test.js
new file mode 100644
index 0000000000..72ca7f02fc
--- /dev/null
+++ b/src/js/tests/bindings.test.js
@@ -0,0 +1,58 @@
+import addon from '../lib/bindings.cjs';
+
+import assert from 'node:assert';
+import { describe, it, before, after } from 'node:test';
+import { models } from './models.js';
+
+const MODEL_PATH = process.env.MODEL_PATH
+  || `./tests/models/${models[0].split('/')[1]}`;
+
+describe('bindings', () => {
+  let pipeline = null;
+
+  before((_, done) => {
+    pipeline = new addon.LLMPipeline();
+
+    pipeline.init(MODEL_PATH, 'CPU', (err) => {
+      if (err) {
+        console.error(err);
+        process.exit(1);
+      }
+
+      pipeline.startChat((err) => {
+        if (err) {
+          console.error(err);
+          process.exit(1);
+        }
+
+        done();
+      });
+    });
+  });
+
+  after((_, done) => {
+    pipeline.finishChat((err) => {
+      if (err) {
+        console.error(err);
+        process.exit(1);
+      }
+
+      done();
+    });
+  });
+
+  it('should generate string result', (_, done) => {
+    let output = '';
+
+    pipeline.generate('Continue: 1 2 3', (isDone, chunk) => {
+      if (!isDone) {
+        output += chunk;
+
+        return;
+      }
+
+      assert.ok(output.length > 0);
+      done();
+    }, { temperature: '0', max_new_tokens: '4' });
+  });
+});
diff --git a/src/js/tests/models.js b/src/js/tests/models.js
new file mode 100644
index 0000000000..03c689e038
--- /dev/null
+++ b/src/js/tests/models.js
@@ -0,0 +1,3 @@
+export const models = [
+  'OpenVINO/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov',
+];
diff --git a/src/js/tests/module.test.js b/src/js/tests/module.test.js
new file mode 100644
index 0000000000..0825625d3f
--- /dev/null
+++ b/src/js/tests/module.test.js
@@ -0,0 +1,142 @@
+import { Pipeline } from '../lib/module.js';
+
+import assert from 'node:assert/strict';
+import { describe, it, before, after } from 'node:test';
+import { models } from './models.js';
+
+const MODEL_PATH = process.env.MODEL_PATH
+  || `./tests/models/${models[0].split('/')[1]}`;
+
+describe('module', async () => {
+  let pipeline = null;
+
+  await before(async () => {
+    pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU');
+
+    await pipeline.startChat();
+  });
+
+  await after(async () => {
+    await pipeline.finishChat();
+  });
+
+  await it('should generate non empty string', async () => {
+    const result = await pipeline.generate(
+      'Type something in English',
+      { temperature: '0', max_new_tokens: '4' },
+      () => {},
+    );
+
+    assert.ok(result.length > 0);
+  });
+});
+
+describe('corner cases', async () => {
+  it('should throw an error if pipeline is already initialized', async () => {
+    const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU');
+
+    await assert.rejects(
+      async () => await pipeline.init(),
+      {
+        name: 'Error',
+        message: 'Pipeline is already initialized',
+      },
+    );
+  });
+
+  it('should throw an error if chat is already started', async () => {
+    const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU');
+
+    await pipeline.startChat();
+
+    await assert.rejects(
+      () => pipeline.startChat(),
+      {
+        name: 'Error',
+        message: 'Chat is already started',
+      },
+    );
+  });
+
+  it('should throw an error if chat is not started', async () => {
+    const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU');
+
+    await assert.rejects(
+      () => pipeline.finishChat(),
+      {
+        name: 'Error',
+        message: 'Chat is not started',
+      },
+    );
+  });
+});
+
+describe('generation parameters validation', () => {
+  let pipeline = null;
+
+  before(async () => {
+    pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU');
+
+    await pipeline.startChat();
+  });
+
+  after(async () => {
+    await pipeline.finishChat();
+  });
+
+  it('should throw an error if temperature is not a number', async () => {
+    await assert.rejects(
+      async () => await pipeline.generate(),
+      {
+        name: 'Error',
+        message: 'Prompt must be a string',
+      },
+    );
+  });
+
+  it('should throw an error if generationCallback is not a function', async () => {
+    const pipeline = await Pipeline.LLMPipeline(MODEL_PATH, 'CPU');
+
+    await pipeline.startChat();
+
+    await assert.rejects(
+      async () => await pipeline.generate('prompt', {}, false),
+      {
+        name: 'Error',
+        message: 'Generation callback must be a function',
+      },
+    );
+  });
+
+  it('should throw an error if options specified but not an object', async () => {
+    await assert.rejects(
+      async () => await pipeline.generate('prompt', 'options', () => {}),
+      {
+        name: 'Error',
+        message: 'Options must be an object',
+      },
+    );
+  });
+
+  it('should perform generation with default options', async () => {
+    try {
+      await pipeline.generate('prompt', { max_new_tokens: 1 });
+    } catch (error) {
+      assert.fail(error);
+    }
+
+    assert.ok(true);
+  });
+
+  it('should return a string as generation result', async () => {
+    const reply = await pipeline.generate('prompt', { max_new_tokens: 1 });
+
+    assert.strictEqual(typeof reply, 'string');
+  });
+
+  it('should call generationCallback with string chunk', async () => {
+    await pipeline.generate('prompt', { max_new_tokens: 1 }, (chunk) => {
+      assert.strictEqual(typeof chunk, 'string');
+    });
+  });
+});
diff --git a/src/js/tests/setup.js b/src/js/tests/setup.js
new file mode 100644
index 0000000000..3b52651719
--- /dev/null
+++ b/src/js/tests/setup.js
@@ -0,0 +1,6 @@
+import { dowloadModel } from './utils.js';
+import { models } from './models.js';
+
+for (const model of models) {
+  await dowloadModel(model);
+}
diff --git a/src/js/tests/utils.js b/src/js/tests/utils.js
new file mode 100644
index 0000000000..504782d8d1
--- /dev/null
+++ b/src/js/tests/utils.js
@@ -0,0 +1,47 @@
+import { bootstrap } from 'global-agent';
+import { promises as fs } from 'node:fs';
+import { listFiles, downloadFile } from '@huggingface/hub';
+
+const BASE_DIR = './tests/models/';
+
+bootstrap();
+
+export async function dowloadModel(repo) {
+  console.log(`Downloading model '${repo}'`);
+
+  const fetch = await import('node-fetch');
+  const modelName = repo.split('/')[1];
+  const destDir = `${BASE_DIR}${modelName}`;
+
+  await fs.mkdir(destDir, { recursive: true });
+
+  const fileList = await listFiles({
+    repo,
+    fetch: fetch.default,
+  });
+  const fileNames = [];
+  for await (const file of fileList) {
+    fileNames.push(file.path);
+  }
+
+  for (const path of fileNames) {
+    console.log(`Downloading file '${path}'`);
+    const response = await downloadFile({
+      repo,
+      path,
+      fetch: fetch.default,
+    });
+    const filename = `${destDir}/${path}`;
+
+    await saveFile(filename, response);
+    console.log(`File '${path}' downloaded`);
+  }
+
+  console.log(`Model '${repo}' downloaded`);
+}
+
+async function saveFile(file, response) {
+  const arrayBuffer = await response.arrayBuffer();
+
+  await fs.writeFile(file, Buffer.from(arrayBuffer));
+}
diff --git a/src/js/thirdparty/node-lib.def b/src/js/thirdparty/node-lib.def
new file mode 100644
index 0000000000..8d46bbec84
--- /dev/null
+++ b/src/js/thirdparty/node-lib.def
@@ -0,0 +1,147 @@
+NAME NODE.EXE
+EXPORTS
+napi_async_destroy
+napi_async_init
+napi_cancel_async_work
+napi_create_async_work
+napi_create_buffer
+napi_create_buffer_copy
+napi_create_external_buffer
+napi_delete_async_work
+napi_fatal_error
+napi_get_buffer_info
+napi_get_node_version
+napi_is_buffer
+napi_make_callback
+napi_module_register
+napi_queue_async_work
+napi_adjust_external_memory
+napi_call_function
+napi_close_escapable_handle_scope
+napi_close_handle_scope
+napi_coerce_to_bool
+napi_coerce_to_number
+napi_coerce_to_object
+napi_coerce_to_string
+napi_create_array
+napi_create_array_with_length
+napi_create_arraybuffer
+napi_create_dataview
+napi_create_double
+napi_create_error
+napi_create_external
+napi_create_external_arraybuffer
+napi_create_function
+napi_create_int32
+napi_create_int64
+napi_create_object
+napi_create_promise
+napi_create_range_error
+napi_create_reference
+napi_create_string_latin1
+napi_create_string_utf16
+napi_create_string_utf8
+napi_create_symbol
+napi_create_type_error
+napi_create_typedarray
+napi_create_uint32
+napi_define_class
+napi_define_properties
+napi_delete_element
+napi_delete_property
+napi_delete_reference
+napi_escape_handle
+napi_get_and_clear_last_exception
+napi_get_array_length
+napi_get_arraybuffer_info
+napi_get_boolean
+napi_get_cb_info
+napi_get_dataview_info
+napi_get_element
+napi_get_global
+napi_get_last_error_info
+napi_get_named_property
+napi_get_new_target
+napi_get_null
+napi_get_property
+napi_get_property_names
+napi_get_prototype
+napi_get_reference_value
+napi_get_typedarray_info
+napi_get_undefined
+napi_get_value_bool
+napi_get_value_double
+napi_get_value_external
+napi_get_value_int32
+napi_get_value_int64
+napi_get_value_string_latin1
+napi_get_value_string_utf16
+napi_get_value_string_utf8
+napi_get_value_uint32
+napi_get_version
+napi_has_element
+napi_has_named_property
+napi_has_own_property
+napi_has_property
+napi_instanceof
+napi_is_array
+napi_is_arraybuffer
+napi_is_dataview
+napi_is_error
+napi_is_exception_pending
+napi_is_promise
+napi_is_typedarray
+napi_new_instance
+napi_open_escapable_handle_scope
+napi_open_handle_scope
+napi_reference_ref
+napi_reference_unref
+napi_reject_deferred
+napi_remove_wrap
+napi_resolve_deferred
+napi_run_script
+napi_set_element
+napi_set_named_property
+napi_set_property
+napi_strict_equals
+napi_throw
+napi_throw_error
+napi_throw_range_error
+napi_throw_type_error
+napi_typeof
+napi_unwrap
+napi_wrap
+napi_get_uv_event_loop
+napi_add_env_cleanup_hook
+napi_close_callback_scope
+napi_fatal_exception
+napi_open_callback_scope
+napi_remove_env_cleanup_hook
+napi_acquire_threadsafe_function
+napi_call_threadsafe_function
+napi_create_threadsafe_function
+napi_get_threadsafe_function_context
+napi_ref_threadsafe_function
+napi_release_threadsafe_function
+napi_unref_threadsafe_function
+napi_add_finalizer
+napi_create_date
+napi_get_date_value
+napi_is_date
+napi_create_bigint_int64
+napi_create_bigint_uint64
+napi_create_bigint_words
+napi_get_all_property_names
+napi_get_instance_data
+napi_get_value_bigint_int64
+napi_get_value_bigint_uint64
+napi_get_value_bigint_words
+napi_set_instance_data
+napi_detach_arraybuffer
+napi_is_detached_arraybuffer
+napi_add_async_cleanup_hook
+napi_remove_async_cleanup_hook
+napi_check_object_type_tag
+napi_object_freeze
+napi_object_seal
+napi_type_tag_object
diff --git a/src/js/thirdparty/win_delay_load_hook.cc b/src/js/thirdparty/win_delay_load_hook.cc
new file mode 100644
index 0000000000..9e652fa4df
--- /dev/null
+++ b/src/js/thirdparty/win_delay_load_hook.cc
@@ -0,0 +1,52 @@
+/*
+ * When this file is linked to a DLL, it sets up a delay-load hook that
+ * intervenes when the DLL is trying to load 'node.exe' or 'iojs.exe'
+ * dynamically. Instead of trying to locate the .exe file it'll just return
+ * a handle to the process image.
+ *
+ * This allows compiled addons to work when node.exe or iojs.exe is renamed.
+ */
+
+#ifdef _MSC_VER
+
+#ifndef WIN32_LEAN_AND_MEAN
+#define WIN32_LEAN_AND_MEAN
+#endif
+
+#include <windows.h>
+
+#include <delayimp.h>
+#include <string.h>
+
+static HMODULE node_dll = NULL;
+static HMODULE nw_dll = NULL;
+
+static FARPROC WINAPI load_exe_hook(unsigned int event, DelayLoadInfo* info) {
+  if (event == dliNotePreGetProcAddress) {
+    FARPROC ret = NULL;
+    ret = GetProcAddress(node_dll, info->dlp.szProcName);
+    if (ret)
+      return ret;
+    ret = GetProcAddress(nw_dll, info->dlp.szProcName);
+    return ret;
+  }
+  if (event == dliStartProcessing) {
+    node_dll = GetModuleHandleA("node.dll");
+    nw_dll = GetModuleHandleA("nw.dll");
+    return NULL;
+  }
+  if (event != dliNotePreLoadLibrary)
+    return NULL;
+
+  if (_stricmp(info->szDll, "node.exe") != 0)
+    return NULL;
+  
+  // Fall back to the current process
+  if(!node_dll) node_dll = GetModuleHandleA(NULL);
+
+  return (FARPROC) node_dll;
+}
+
+decltype(__pfnDliNotifyHook2) __pfnDliNotifyHook2 = load_exe_hook;
+
+#endif

From 5cbadd1603c4019a046bbf46b0dd87feab1e7cbd Mon Sep 17 00:00:00 2001
From: Ilya Lavrenov <ilya.lavrenov@intel.com>
Date: Wed, 29 Jan 2025 12:42:41 +0400
Subject: [PATCH 03/15] CB: preparation for relying on KV cache precisions from
 plugins (#1634)

- Currently we have logic to detect KV cache precision and this logic
become more and more complex
- The idea is to rely on plugin's logic and compiled PA model with
`ov::element::dynamic` precisions for KV cache inputs.
- Later, take `ov::CompiledModel` and extract precisions from its
`inputs()`
- Then create tensors based on computed `num_kv_blocks` which depends on
KV cache precisions.

Currently, logic to mimic plugin's logic for KV cache precisions is
still here, but will be dropped once plugin will support
`ov::element::dynamic`
---
 .github/labeler.yml                           |   4 +-
 src/cpp/src/cache_manager.hpp                 | 169 ++++++++++++------
 src/cpp/src/continuous_batching_impl.cpp      | 127 +++++++++++--
 src/cpp/src/continuous_batching_impl.hpp      |   4 +-
 src/cpp/src/device_config.hpp                 | 115 +-----------
 .../paged_attention_transformations.cpp       |  24 ++-
 .../paged_attention_transformations.hpp       |   0
 src/cpp/src/scheduler.hpp                     |  30 ++--
 ...batching_for_speculative_decoding_impl.cpp |   3 +-
 ...batching_for_speculative_decoding_impl.hpp |   3 +-
 .../speculative_decoding_impl.cpp             |  11 +-
 tests/cpp/CMakeLists.txt                      |   6 +-
 tests/cpp/cache_manager.cpp                   |  91 ++++------
 tests/cpp/device_config.cpp                   |  33 ----
 tests/cpp/helper.cpp                          |  27 +++
 tests/cpp/helper.hpp                          |   8 +
 tests/cpp/scheduler.cpp                       |  34 +---
 tests/cpp/speculative_decoding.cpp            |   3 +-
 18 files changed, 352 insertions(+), 340 deletions(-)
 rename src/cpp/src/{utils => }/paged_attention_transformations.cpp (80%)
 rename src/cpp/src/{utils => }/paged_attention_transformations.hpp (100%)
 delete mode 100644 tests/cpp/device_config.cpp
 create mode 100644 tests/cpp/helper.cpp
 create mode 100644 tests/cpp/helper.hpp

diff --git a/.github/labeler.yml b/.github/labeler.yml
index 2bfe4248c1..a75abd795c 100644
--- a/.github/labeler.yml
+++ b/.github/labeler.yml
@@ -103,8 +103,8 @@
 - 'src/cpp/src/generation_handle.cpp'
 - 'src/cpp/src/generation_stream.hpp'
 - 'src/cpp/src/model_runner.hpp'
-- 'src/cpp/src/utils/paged_attention_transformations.cpp'
-- 'src/cpp/src/utils/paged_attention_transformations.hpp'
+- 'src/cpp/src/paged_attention_transformations.cpp'
+- 'src/cpp/src/paged_attention_transformations.hpp'
 - 'src/cpp/src/scheduler.hpp'
 - 'src/cpp/src/sequence_group.cpp'
 - 'src/cpp/src/sequence_group.hpp'
diff --git a/src/cpp/src/cache_manager.hpp b/src/cpp/src/cache_manager.hpp
index 5a0ff9b9f3..255bb926be 100644
--- a/src/cpp/src/cache_manager.hpp
+++ b/src/cpp/src/cache_manager.hpp
@@ -45,19 +45,19 @@ class TensorMmapAllocator {
 #endif
 
 namespace ov::genai {
+
 class CacheManager {
-    DeviceConfig m_device_config;
-    std::vector<ov::Tensor> m_key_cache;
-    std::vector<ov::Tensor> m_value_cache;
-    size_t m_num_allocated_kv_blocks = 0;
+    size_t m_num_decoder_layers = 0;
+    std::string m_device;
+    std::vector<ov::element::Type> m_key_precisions, m_value_precisions;
+    std::vector<ov::PartialShape> m_key_shapes, m_value_shapes;
+    std::vector<ov::Tensor> m_key_cache, m_value_cache;
+    size_t m_num_allocated_kv_blocks = 0, m_block_size_in_bytes = 0;
     ov::InferRequest m_request;
-    ov::Core m_core;
 
-    ov::Shape set_first_dim_and_make_static(const ov::PartialShape& shape, size_t dim) {
-        ov::PartialShape res_shape = shape;
-        res_shape[0] = dim;
-        OPENVINO_ASSERT(res_shape.is_static());
-        return res_shape.to_shape();
+    static ov::Shape set_kv_blocks(ov::PartialShape pshape, size_t num_kv_blocks) {
+        pshape[0] = num_kv_blocks;
+        return pshape.get_shape();
     }
 
     void update_request_tensor(size_t decoder_layer_id) {
@@ -65,41 +65,106 @@ class CacheManager {
         m_request.set_tensor(std::string("value_cache.") + std::to_string(decoder_layer_id), m_value_cache[decoder_layer_id]);
     }
 
+    ov::PartialShape patch_shape(ov::PartialShape pshape, ov::element::Type cache_type) {
+        OPENVINO_ASSERT(!m_device.empty(), "Internal error: device is not set");
+
+        if (m_device.find("CPU") != std::string::npos && cache_type == ov::element::u8) {
+            // Scale, zero point and quantized data will be stored together.
+            // The layout for per token per head:
+            // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)|
+            // so, we have to extend head_size by 8, which is sizeof(float)
+            // for scale and sizeof(float) for zeropoint
+            pshape[3] += 2 * sizeof(float);
+        }
+
+        return pshape;
+    }
+
 public:
-    explicit CacheManager(const DeviceConfig &device_config, ov::InferRequest request, ov::Core core) :
-            m_device_config(device_config),
-            m_request(request),
-            m_core(core) {
-        m_key_cache.reserve(m_device_config.get_num_layers());
-        m_value_cache.reserve(m_device_config.get_num_layers());
+    CacheManager(ov::InferRequest request, const DeviceConfig& device_config) :
+        m_request(request) {
+        // extract information about inference device
+        ov::CompiledModel compiled_model = request.get_compiled_model();
+        std::vector<std::string> execution_devices = compiled_model.get_property(ov::execution_devices);
+        OPENVINO_ASSERT(execution_devices.size() == 1, "Contituous batching: execution device is expected to be CPU or GPU, but got ", execution_devices.size(), " devices");
+        m_device = execution_devices[0];
+
+        // extract information about KV cache precisions and shapes
+        size_t kv_input_index = 0;
+        for (const auto& input : compiled_model.inputs()) {
+            for (auto & name : input.get_names()) {
+                auto cache_precision = input.get_element_type();
+
+                if (name.find("key_cache.") == 0) {
+                    auto pshape = patch_shape(device_config.get_key_cache_shape(kv_input_index), cache_precision);
+                    m_key_shapes.push_back(pshape);
+                    m_key_precisions.push_back(cache_precision);
+                    m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();
+                    break;
+                } else if (name.find("value_cache.") == 0) {
+                    auto pshape = patch_shape(device_config.get_value_cache_shape(kv_input_index), cache_precision);
+                    m_value_shapes.push_back(pshape);
+                    m_value_precisions.push_back(cache_precision);
+                    m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();
+                    ++kv_input_index;
+                    break;
+                }
+            }
+        }
+
+        m_num_decoder_layers = m_value_precisions.size();
+        OPENVINO_ASSERT(m_num_decoder_layers == m_key_precisions.size(), "Invalid case: a different number of K and V caches in a LLM model");
+    }
+
+    size_t get_num_decoder_layers() const {
+        return m_num_decoder_layers;
+    }
+
+    std::string get_device() const {
+        return m_device;
+    }
+
+    ov::element::Type get_key_cache_precision(size_t decoder_layer_id) const {
+        OPENVINO_ASSERT(decoder_layer_id < m_key_precisions.size());
+        return m_key_precisions[decoder_layer_id];
+    }
+
+    ov::element::Type get_value_cache_precision(size_t decoder_layer_id) const {
+        OPENVINO_ASSERT(decoder_layer_id < m_value_precisions.size());
+        return m_value_precisions[decoder_layer_id];
+    }
+
+    size_t get_block_size_in_bytes() const {
+        return m_block_size_in_bytes;
     }
 
     void allocate_cache_if_needed(size_t num_kv_blocks) {
         if (m_num_allocated_kv_blocks >= num_kv_blocks) {
             return;
         }
-        OPENVINO_ASSERT(m_key_cache.size() == m_value_cache.size());
-        m_num_allocated_kv_blocks = num_kv_blocks;
 
-        const std::string device_name = m_device_config.get_device();
+        m_num_allocated_kv_blocks = num_kv_blocks;
 
         ov::Coordinate start_key{0,0,0,0};
         ov::Coordinate start_value{0,0,0,0};
 
-        if (device_name.find("GPU") == std::string::npos) {// Allocate KV caches
-            for (size_t decoder_layer_id = 0; decoder_layer_id < m_device_config.get_num_layers(); ++decoder_layer_id) {
-                ov::Shape value_cache_shape = set_first_dim_and_make_static(m_device_config.get_value_cache_shape(decoder_layer_id), num_kv_blocks);
-                ov::Shape key_cache_shape = set_first_dim_and_make_static(m_device_config.get_key_cache_shape(decoder_layer_id), num_kv_blocks);
+        if (m_device.find("GPU") == std::string::npos) {// Allocate KV caches
+            for (size_t decoder_layer_id = 0; decoder_layer_id < m_num_decoder_layers; ++decoder_layer_id) {
+                ov::Shape value_cache_shape = set_kv_blocks(m_value_shapes[decoder_layer_id], num_kv_blocks);
+                ov::Shape key_cache_shape = set_kv_blocks(m_key_shapes[decoder_layer_id], num_kv_blocks);
+
+                ov::element::Type key_precision = get_key_cache_precision(decoder_layer_id);
+                ov::element::Type value_precision = get_value_cache_precision(decoder_layer_id);
+
 #ifdef _WIN32
-                ov::Tensor key_cache(m_device_config.get_cache_precision(), key_cache_shape);
-                ov::Tensor value_cache(m_device_config.get_cache_precision(), value_cache_shape);
+                ov::Tensor key_cache(key_precision, key_cache_shape);
+                ov::Tensor value_cache(value_precision, value_cache_shape);
 #else
-                auto key_size = ov::shape_size(key_cache_shape) * m_device_config.get_cache_precision().size();
-                auto value_size = ov::shape_size(value_cache_shape) * m_device_config.get_cache_precision().size();
-
-                ov::Tensor key_cache = ov::Tensor(m_device_config.get_cache_precision(), key_cache_shape, TensorMmapAllocator(key_size));
-                ov::Tensor value_cache = ov::Tensor(m_device_config.get_cache_precision(), value_cache_shape, TensorMmapAllocator(value_size));
+                auto key_size = ov::shape_size(key_cache_shape) * key_precision.size();
+                auto value_size = ov::shape_size(value_cache_shape) * value_precision.size();
 
+                ov::Tensor key_cache(key_precision, key_cache_shape, TensorMmapAllocator(key_size));
+                ov::Tensor value_cache(value_precision, value_cache_shape, TensorMmapAllocator(value_size));
 #endif
 
                 auto key_cache_roi_end = static_cast<unsigned char*>(key_cache.data());
@@ -137,8 +202,7 @@ class CacheManager {
                 if (m_key_cache.size() > decoder_layer_id) {
                     m_key_cache[decoder_layer_id] = key_cache;
                     m_value_cache[decoder_layer_id] = value_cache;
-                }
-                else {
+                } else {
                     m_key_cache.emplace_back(key_cache);
                     m_value_cache.emplace_back(value_cache);
                 }
@@ -146,15 +210,15 @@ class CacheManager {
                 update_request_tensor(decoder_layer_id);
             }
         } else {
-            auto remote_context = m_core.get_default_context(device_name);
-            for (size_t decoder_layer_id = 0; decoder_layer_id < m_device_config.get_num_layers(); ++decoder_layer_id) {
-                ov::Shape value_cache_shape = set_first_dim_and_make_static(m_device_config.get_value_cache_shape(decoder_layer_id), num_kv_blocks);
-                ov::Shape key_cache_shape = set_first_dim_and_make_static(m_device_config.get_key_cache_shape(decoder_layer_id), num_kv_blocks);
-                ov::Tensor key_cache = remote_context.create_tensor(m_device_config.get_cache_precision(),
-                                                                    key_cache_shape);
-                ov::Tensor value_cache = remote_context.create_tensor(m_device_config.get_cache_precision(),
-                                                                      value_cache_shape);
-                
+            auto remote_context = m_request.get_compiled_model().get_context();
+
+            for (size_t decoder_layer_id = 0; decoder_layer_id < m_num_decoder_layers; ++decoder_layer_id) {
+                ov::Shape value_cache_shape = set_kv_blocks(m_value_shapes[decoder_layer_id], num_kv_blocks);
+                ov::Shape key_cache_shape = set_kv_blocks(m_key_shapes[decoder_layer_id], num_kv_blocks);
+
+                ov::Tensor key_cache = remote_context.create_tensor(get_key_cache_precision(decoder_layer_id), key_cache_shape);
+                ov::Tensor value_cache = remote_context.create_tensor(get_value_cache_precision(decoder_layer_id), value_cache_shape);
+
                 if (m_key_cache.size() > decoder_layer_id) {
                     ov::Coordinate end_key = m_key_cache[decoder_layer_id].get_shape();
                     ov::Coordinate end_value = m_value_cache[decoder_layer_id].get_shape();
@@ -167,23 +231,23 @@ class CacheManager {
 
                     m_key_cache[decoder_layer_id] = key_cache;
                     m_value_cache[decoder_layer_id] = value_cache;
-                }
-                else {
+                } else {
                     m_key_cache.emplace_back(key_cache);
                     m_value_cache.emplace_back(value_cache);
                 }
+
                 update_request_tensor(decoder_layer_id);
             }
         }
     }
 
     ov::Tensor get_key_cache(size_t decoder_layer_id) const {
-        OPENVINO_ASSERT(decoder_layer_id < m_key_cache.size());
+        OPENVINO_ASSERT(decoder_layer_id < m_key_cache.size(), "decoder_layer_id = ", decoder_layer_id, ", num_layers = ", m_key_cache.size());
         return m_key_cache[decoder_layer_id];
     }
 
     ov::Tensor get_value_cache(size_t decoder_layer_id) const {
-        OPENVINO_ASSERT(decoder_layer_id < m_value_cache.size());
+        OPENVINO_ASSERT(decoder_layer_id < m_value_cache.size(), "decoder_layer_id = ", decoder_layer_id, ", num_layers = ", m_value_cache.size());
         return m_value_cache[decoder_layer_id];
     }
 
@@ -192,9 +256,9 @@ class CacheManager {
             size_t src_block_id = blocks_pair.first;
             const std::list<size_t>& dst_block_ids = blocks_pair.second;
             for (size_t dst_block_id : dst_block_ids) {
-                for (size_t decoder_layer_id = 0; decoder_layer_id < m_device_config.get_num_layers(); ++decoder_layer_id) {
-                    ov::Shape key_shape = set_first_dim_and_make_static(m_device_config.get_key_cache_shape(decoder_layer_id), m_num_allocated_kv_blocks);
-                    ov::Shape value_shape = set_first_dim_and_make_static(m_device_config.get_value_cache_shape(decoder_layer_id), m_num_allocated_kv_blocks);
+                for (size_t decoder_layer_id = 0; decoder_layer_id < m_num_decoder_layers; ++decoder_layer_id) {
+                    ov::Shape key_shape = set_kv_blocks(m_key_shapes[decoder_layer_id], m_num_allocated_kv_blocks);
+                    ov::Shape value_shape = set_kv_blocks(m_value_shapes[decoder_layer_id], m_num_allocated_kv_blocks);
                     ov::Coordinate key_src_start_roi(key_shape.size(), 0);
                     ov::Coordinate key_src_end_roi = key_shape;
                     ov::Coordinate key_dst_start_roi(key_shape.size(), 0);
@@ -221,13 +285,6 @@ class CacheManager {
             }
         }
     }
-
-    std::shared_ptr<Core> get_core() {
-        return std::make_shared<Core>(m_core);
-    }
-
-    std::shared_ptr<DeviceConfig> get_device_config() {
-        return std::make_shared<DeviceConfig>(m_device_config);
-    }
 };
+
 }
diff --git a/src/cpp/src/continuous_batching_impl.cpp b/src/cpp/src/continuous_batching_impl.cpp
index 99df043090..b4100f8aec 100644
--- a/src/cpp/src/continuous_batching_impl.cpp
+++ b/src/cpp/src/continuous_batching_impl.cpp
@@ -7,9 +7,95 @@
 #include "text_callback_streamer.hpp"
 #include "continuous_batching_impl.hpp"
 #include "utils.hpp"
-#include "utils/paged_attention_transformations.hpp"
+#include "paged_attention_transformations.hpp"
 #include "lora_helper.hpp"
 #include "cache_state_dumper.hpp"
+#include "utils.hpp"
+
+namespace {
+
+ov::element::Type get_model_kv_cache_precision(std::shared_ptr<ov::Model> model) {
+    const std::vector<std::string> kv_cache_precision_path = { "runtime_options", ov::hint::kv_cache_precision.name() };
+    ov::element::Type ir_kv_cache_precision = ov::element::undefined;
+
+    if (model->has_rt_info(kv_cache_precision_path)) {
+        ir_kv_cache_precision = model->get_rt_info<ov::element::Type>(kv_cache_precision_path);
+    }
+
+    return ir_kv_cache_precision;
+}
+
+void apply_kv_cache_precision(const std::shared_ptr<ov::Model>& model, const std::string& device, const ov::AnyMap& plugin_config) {
+    ov::element::Type m_kv_cache_type = ov::element::undefined, ir_kv_cache_precision = get_model_kv_cache_precision(model);
+    ov::Core core = ov::genai::utils::singleton_core();
+
+    auto inference_precision = core.get_property(device, ov::hint::inference_precision);
+    // if user sets properties affecting KV cache precision
+    const auto inference_precision_it = plugin_config.find(ov::hint::inference_precision.name());
+    const auto kv_cache_precision_it = plugin_config.find(ov::hint::kv_cache_precision.name());
+    const auto execution_mode_it = plugin_config.find(ov::hint::execution_mode.name());
+    const bool accuracy_mode = execution_mode_it != plugin_config.end() &&
+        execution_mode_it->second.as<ov::hint::ExecutionMode>() == ov::hint::ExecutionMode::ACCURACY;
+
+    if (device == "CPU") {
+        if (kv_cache_precision_it != plugin_config.end()) {
+            const auto kv_cache_precision = kv_cache_precision_it->second.as<ov::element::Type>();
+            m_kv_cache_type = kv_cache_precision;
+        } else if (accuracy_mode) {
+            // ACCURACY mode will use f32 KV cache type
+            m_kv_cache_type = ov::element::f32;
+        } else if (ir_kv_cache_precision != ov::element::undefined) {
+            // check that kv_cache_precision is set in runtime_info section of OpenVINO IR
+            // but in case it's set to FP16, we need to patch it to be BF16 for Xeon platforms
+            m_kv_cache_type = ir_kv_cache_precision == ov::element::f16 && inference_precision == ov::element::bf16 ?
+                inference_precision : ir_kv_cache_precision;
+        } else {
+            // x86 and ARM have different default kv cache type, take this information from the plugin
+            m_kv_cache_type = core.get_property(device, ov::hint::kv_cache_precision);
+        }
+
+        // TEMP WA: currently FP16 / BF16 KV cache is faster than U8 for PagedAttention
+        if (m_kv_cache_type == ov::element::u8) {
+            m_kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16;
+        }
+    } else if (device.find("GPU") != std::string::npos) {
+        if (accuracy_mode) {
+            inference_precision = ov::element::f32;
+        }
+        if (inference_precision_it != plugin_config.end()) {
+            inference_precision = inference_precision_it->second.as<ov::element::Type>();
+        }
+
+        m_kv_cache_type = inference_precision;
+    } else {
+        OPENVINO_THROW(device, " is not supported by OpenVINO Continuous Batching");
+    }
+
+    std::map<std::string, std::shared_ptr<ov::op::v0::Parameter>> key_cache_params, value_cache_params;
+    for (const auto& param_ptr : model->get_parameters()) {
+        const auto& name = param_ptr->get_friendly_name();
+        if (name.find("key_cache.") == 0) {
+            key_cache_params[name] = param_ptr;
+        } else if (name.find("value_cache.") == 0) {
+            value_cache_params[name] = param_ptr;
+        }
+    }
+
+    OPENVINO_ASSERT(key_cache_params.size() == value_cache_params.size() && key_cache_params.size() > 0);
+
+    size_t num_decoder_layers = key_cache_params.size();
+    for (size_t idx = 0; idx < num_decoder_layers; idx++) {
+        auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)];
+        auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)];
+
+        k->set_element_type(m_kv_cache_type);
+        v->set_element_type(m_kv_cache_type);
+    }
+
+    model->validate_nodes_and_infer_types();
+}
+
+} // namespace
 
 namespace ov::genai {
 template<class... Ts> struct overloaded : Ts... {using Ts::operator()...;};
@@ -27,15 +113,14 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::ContinuousBatchingImpl(
     m_generation_config = generation_config;
     m_is_validation_mode_enabled = is_validation_mode_enabled;
 
-    ov::Core core = utils::singleton_core();
-    DeviceConfig device_config(core, scheduler_config, device, properties);
+    DeviceConfig device_config(device);
 
     bool is_need_per_layer_cache_control = scheduler_config.use_cache_eviction;
     bool allow_cache_rotation = scheduler_config.cache_eviction_config.apply_rotation;
     utils::apply_paged_attention_transformations(model, device_config, is_need_per_layer_cache_control, allow_cache_rotation);
     utils::apply_gather_before_matmul_transformation(model);
 
-    initialize_pipeline(model, scheduler_config, properties, device_config, core);
+    initialize_pipeline(model, scheduler_config, properties, device_config);
 }
 
 ContinuousBatchingPipeline::ContinuousBatchingImpl::~ContinuousBatchingImpl() {
@@ -55,10 +140,13 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
     std::shared_ptr<ov::Model> model,
     const SchedulerConfig& scheduler_config,
     const ov::AnyMap& properties,
-    const DeviceConfig& device_config,
-    ov::Core& core) {
+    const DeviceConfig& device_config) {
+    ov::Core core = utils::singleton_core();
     ov::CompiledModel compiled_model;
 
+    // TODO: remove once plugin automatically set KV cache precisions
+    apply_kv_cache_precision(model, device_config.get_device(), properties);
+
     // apply LoRA
     if (auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters)) {
         m_generation_config.adapters->set_tensor_name_prefix("base_model.model.model.");
@@ -71,24 +159,27 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
     ov::genai::utils::print_compiled_model_properties(compiled_model, "LLM with Paged Attention");
     ov::InferRequest infer_request = compiled_model.create_infer_request();
 
-    m_num_decoder_layers = device_config.get_num_layers();
-
-    // setup KV caches
-    std::shared_ptr<CacheManager> cache_manager = std::make_shared<CacheManager>(device_config, infer_request, core);
+    // Cache manager
+    std::shared_ptr<CacheManager> cache_manager = std::make_shared<CacheManager>(infer_request, device_config);
+    m_num_decoder_layers = cache_manager->get_num_decoder_layers();
 
-    SchedulerConfig updated_config = scheduler_config;
-    // update KV blocks number in scheduler config
-    if (scheduler_config.num_kv_blocks != device_config.get_num_kv_blocks()) {
-        updated_config.num_kv_blocks = device_config.get_num_kv_blocks();
+    // Scheduler
+    SchedulerConfig normalized_config = scheduler_config;
+    if (normalized_config.num_kv_blocks == 0 && normalized_config.cache_size > 0) {
+        size_t size_in_bytes = normalized_config.cache_size * 1024 * 1024 * 1024; // convert GBs to bytes
+        normalized_config.num_kv_blocks = size_in_bytes / cache_manager->get_block_size_in_bytes();
     }
 
     bool can_use_partial_preemption = true;
-    if (device_config.get_device().find("GPU") != std::string::npos && !updated_config.dynamic_split_fuse) {
+    if (device_config.get_device().find("GPU") != std::string::npos && !normalized_config.dynamic_split_fuse) {
         // in case of executing a `vLLM-like` pipeline, it's better not to use partial eviction on the GPU,
         // as it may lead to performance slowdown
         can_use_partial_preemption = false;
     }
-    m_scheduler = std::make_shared<Scheduler>(device_config.get_block_size(), cache_manager, updated_config, device_config.get_num_layers(), can_use_partial_preemption);
+
+    m_scheduler = std::make_shared<Scheduler>(device_config.get_block_size(), cache_manager, normalized_config, m_num_decoder_layers, can_use_partial_preemption);
+
+    // Model Runner
     bool is_use_cache_eviction = m_scheduler->get_config().use_cache_eviction;
     if (is_use_cache_eviction) {
         const auto& eviction_config = m_scheduler->get_config().cache_eviction_config;
@@ -101,14 +192,14 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
                                                        /* is_use_rotation_inputs = */ is_apply_rotation);
         if (eviction_config.apply_rotation) {
             m_rotation_deltas_stores.reserve(m_num_decoder_layers);
-            ov::Shape rotation_deltas_store_shape{scheduler_config.num_kv_blocks, 1}; // last dim can be later changed to BLOCK_SIZE for per-token granularity
+            ov::Shape rotation_deltas_store_shape{normalized_config.num_kv_blocks, 1}; // last dim can be later changed to BLOCK_SIZE for per-token granularity
             for (size_t i = 0; i < m_num_decoder_layers; i++) {
                 ov::Tensor store(ov::element::i32, rotation_deltas_store_shape);
                 std::memset(store.data(), 0, store.get_byte_size());
                 m_rotation_deltas_stores.push_back(store);
             }
 
-            size_t max_sequence_cache_occupation_length_in_blocks = scheduler_config.max_num_batched_tokens / m_scheduler->get_block_size()  + 1;
+            size_t max_sequence_cache_occupation_length_in_blocks = normalized_config.max_num_batched_tokens / m_scheduler->get_block_size()  + 1;
             size_t embedding_size = device_config.get_k_head_size(0);
             m_cache_rotation_calculator = std::make_shared<CacheRotationCalculator>(
                 m_scheduler->get_block_size(),
diff --git a/src/cpp/src/continuous_batching_impl.hpp b/src/cpp/src/continuous_batching_impl.hpp
index f64657bc7a..9fa6c9c660 100644
--- a/src/cpp/src/continuous_batching_impl.hpp
+++ b/src/cpp/src/continuous_batching_impl.hpp
@@ -59,9 +59,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc
     void initialize_pipeline(std::shared_ptr<ov::Model> model,
                              const SchedulerConfig& scheduler_config,
                              const ov::AnyMap& plugin_config,
-                             const DeviceConfig& device_config,
-                             ov::Core& core);
-
+                             const DeviceConfig& device_config);
 
     /**
      * Pulls requests from awaiting queue to running queue
diff --git a/src/cpp/src/device_config.hpp b/src/cpp/src/device_config.hpp
index 3d41960c5e..09020da9a8 100644
--- a/src/cpp/src/device_config.hpp
+++ b/src/cpp/src/device_config.hpp
@@ -20,11 +20,8 @@ struct KVHeadConfig {
 };
 
 class DeviceConfig {
-    ov::element::Type m_kv_cache_type;
     std::vector<ov::PartialShape> m_key_cache_shape, m_value_cache_shape;
     std::vector<KVHeadConfig> m_kv_heads_config;
-    size_t m_num_decoder_layers = 0;
-    size_t m_num_kv_blocks = 0, m_cache_size = 0; // KV cache sizes in either blocks or GBs
     size_t m_block_size = 0; // block size is per inference device 
     std::string m_device;
 
@@ -35,90 +32,17 @@ class DeviceConfig {
     }
 
 public:
-    DeviceConfig(ov::Core& core, const SchedulerConfig& scheduling_config, const std::string& device, const ov::AnyMap& plugin_config = {}) {
+    explicit DeviceConfig(const std::string& device) {
         m_device = device;
-
-        // keep information about blocsk
         m_block_size = get_block_size_by_device(device);
-
-        if (m_device == "CPU") {
-            auto inference_precision = core.get_property(device, ov::hint::inference_precision);
-            m_kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16;
-
-            // if user sets precision hint, kv cache type should be changed
-            const auto inference_precision_it = plugin_config.find(ov::hint::inference_precision.name());
-            if (inference_precision_it != plugin_config.end()) {
-                const auto inference_precision = inference_precision_it->second.as<ov::element::Type>();
-                if (inference_precision == ov::element::f32) {
-                    m_kv_cache_type = ov::element::f32;
-                } else if (inference_precision == ov::element::f16) {
-                    m_kv_cache_type = ov::element::f16;
-                } else if (inference_precision == ov::element::bf16) {
-                    m_kv_cache_type = ov::element::bf16;
-                } else {
-                    // use default f32
-                    m_kv_cache_type = ov::element::f32;
-                }
-            }
-
-            // if user sets ov::kv_cache_precision hint
-            const auto kv_cache_precision_it = plugin_config.find(ov::hint::kv_cache_precision.name());
-            if (kv_cache_precision_it != plugin_config.end()) {
-                const auto kv_cache_precision = kv_cache_precision_it->second.as<ov::element::Type>();
-                m_kv_cache_type = kv_cache_precision;
-            }
-        } else if (m_device.find("GPU") != std::string::npos) {
-            auto inference_precision = core.get_property(device, ov::hint::inference_precision);
-            m_kv_cache_type = inference_precision == ov::element::f16 ? ov::element::f16 : ov::element::f32;
-
-            // if user sets precision hint, kv cache type should be changed
-            const auto inference_precision_it = plugin_config.find(ov::hint::inference_precision.name());
-            if (inference_precision_it != plugin_config.end()) {
-                const auto inference_precision = inference_precision_it->second.as<ov::element::Type>();
-                if (inference_precision == ov::element::f16) {
-                    m_kv_cache_type = ov::element::f16;
-                } else {
-                    // use default f32
-                    m_kv_cache_type = ov::element::f32;
-                }
-            }
-        } else {
-            OPENVINO_THROW(m_device, " is not supported by OpenVINO Continuous Batching");
-        }
-
-        if (scheduling_config.num_kv_blocks > 0) {
-            m_num_kv_blocks = scheduling_config.num_kv_blocks;
-        } else if (scheduling_config.cache_size > 0) {
-            m_cache_size = scheduling_config.cache_size;
-        }
     }
 
-    void set_kv_head_configs(std::vector<KVHeadConfig> kv_heads_config) {
+    void set_kv_head_configs(const std::vector<KVHeadConfig>& kv_heads_config) {
         m_kv_heads_config = kv_heads_config;
-        m_num_decoder_layers = m_kv_heads_config.size();
-        m_key_cache_shape.reserve(m_num_decoder_layers);
-        m_value_cache_shape.reserve(m_num_decoder_layers);
-
-        if (m_device == "CPU") {
-            // Scale, zero point and quantized data will be stored together.
-            // The layout for per token per head:
-            // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)|
-            // so, we have to extend head_size by 8, which is sizeof(float)
-            // for scale and sizeof(float) for zeropoint
-            if (m_kv_cache_type == ov::element::u8) {
-                for (size_t layer_id = 0; layer_id < m_num_decoder_layers; ++layer_id) {
-                    m_kv_heads_config[layer_id].k_head_size += 8;
-                    m_kv_heads_config[layer_id].v_head_size += 8;
-                }
-            }
-        }
+        m_key_cache_shape.reserve(m_kv_heads_config.size());
+        m_value_cache_shape.reserve(m_kv_heads_config.size());
 
-        if (m_num_kv_blocks == 0 && m_cache_size > 0) {
-            size_t size_in_bytes = m_cache_size * 1024 * 1024 * 1024; // convert GBs to bytes
-            m_num_kv_blocks = size_in_bytes / get_block_size_in_bytes();
-        }
-
-        for (size_t layer_id = 0; layer_id < m_num_decoder_layers; layer_id++) {
+        for (size_t layer_id = 0; layer_id < kv_heads_config.size(); layer_id++) {
             const KVHeadConfig& config = m_kv_heads_config[layer_id];
 
             m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
@@ -126,7 +50,7 @@ class DeviceConfig {
                                                            ov::Dimension(m_block_size),
                                                            ov::Dimension(config.v_head_size)});
 
-            if (m_device.find("GPU") == std::string::npos) {
+            if (m_device.find("CPU") != std::string::npos) {
                 m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
                                                              ov::Dimension(config.num_k_heads),
                                                              ov::Dimension(m_block_size),
@@ -145,44 +69,23 @@ class DeviceConfig {
         return m_device;
     }
 
-    ov::element::Type get_cache_precision() const {
-        return m_kv_cache_type;
-    }
-
-    size_t get_num_layers() const {
-        return m_num_decoder_layers;
-    }
-
     ov::PartialShape get_key_cache_shape(size_t id) const {
         OPENVINO_ASSERT(m_key_cache_shape.size());
         return m_key_cache_shape[id];
     }
 
-    size_t get_k_head_size(size_t layer_id) const {
-        return m_kv_heads_config[layer_id].k_head_size;
-    }
-
     ov::PartialShape get_value_cache_shape(size_t id) const {
         OPENVINO_ASSERT(m_value_cache_shape.size());
         return m_value_cache_shape[id];
     }
 
-    size_t get_num_kv_blocks() const {
-        return m_num_kv_blocks;
+    size_t get_k_head_size(size_t layer_id) const {
+        return m_kv_heads_config[layer_id].k_head_size;
     }
 
     size_t get_block_size() const {
         return m_block_size;
     }
-
-    size_t get_block_size_in_bytes() const {
-        size_t block_size_in_bytes = 0;
-        for (size_t layer_id = 0; layer_id < m_num_decoder_layers; layer_id++) {
-            const KVHeadConfig& config = m_kv_heads_config[layer_id];
-            block_size_in_bytes += config.k_head_size * config.num_k_heads + config.v_head_size * config.num_v_heads;
-        }
-        block_size_in_bytes *= get_block_size() * get_cache_precision().size();
-        return block_size_in_bytes;
-    }
 };
+
 }
diff --git a/src/cpp/src/utils/paged_attention_transformations.cpp b/src/cpp/src/paged_attention_transformations.cpp
similarity index 80%
rename from src/cpp/src/utils/paged_attention_transformations.cpp
rename to src/cpp/src/paged_attention_transformations.cpp
index 17a3fdddbe..6d337136dc 100644
--- a/src/cpp/src/utils/paged_attention_transformations.cpp
+++ b/src/cpp/src/paged_attention_transformations.cpp
@@ -1,7 +1,7 @@
 // Copyright (C) 2023-2025 Intel Corporation
 // SPDX-License-Identifier: Apache-2.0
 
-#include "utils/paged_attention_transformations.hpp"
+#include "paged_attention_transformations.hpp"
 
 #include "openvino/pass/manager.hpp"
 #include "openvino/pass/sdpa_to_paged_attention.hpp"
@@ -10,7 +10,6 @@ namespace ov {
 namespace genai {
 namespace utils {
 
-
 size_t get_hidden_size(const std::shared_ptr<ov::Model> model) {
     const auto& parameters = model->get_parameters();
     // extract num_kv_heads and head_size
@@ -50,23 +49,32 @@ void set_kv_cache_type_and_shape(std::shared_ptr<ov::Model> model, DeviceConfig&
     for (size_t idx = 0; idx < num_decoder_layers; idx++) {
         KVHeadConfig& config = kv_heads_config[idx];
 
-        auto key_shape = key_cache_params[std::string("key_cache.") + std::to_string(idx)]->get_partial_shape();
+        auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)];
+        auto key_shape = k->get_partial_shape();
         config.num_k_heads = key_shape[1].get_length();
         config.k_head_size = key_shape[2].get_length();
 
-        auto value_shape = value_cache_params[std::string("value_cache.") + std::to_string(idx)]->get_partial_shape();
+        auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)];
+        auto value_shape = v->get_partial_shape();
         config.num_v_heads = value_shape[1].get_length();
         config.v_head_size = value_shape[2].get_length();
     }
+
+    // save information about KV caches in device_config
+    // and create device dependent KV cache shapes
     device_config.set_kv_head_configs(kv_heads_config);
 
     for (size_t idx = 0; idx < num_decoder_layers; idx++) {
         auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)];
         auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)];
-        k->set_element_type(device_config.get_cache_precision());
-        v->set_element_type(device_config.get_cache_precision());
-        k->set_partial_shape(device_config.get_key_cache_shape(idx));
-        v->set_partial_shape(device_config.get_value_cache_shape(idx));
+
+        // allow a plugin to automatically set KV cache precisions
+        k->set_element_type(ov::element::dynamic);
+        v->set_element_type(ov::element::dynamic);
+
+        // set device specific KV cache shapes back to a PA model
+        k->set_partial_shape(ov::PartialShape::dynamic(4));
+        v->set_partial_shape(ov::PartialShape::dynamic(4));
     }
 
     model->validate_nodes_and_infer_types();
diff --git a/src/cpp/src/utils/paged_attention_transformations.hpp b/src/cpp/src/paged_attention_transformations.hpp
similarity index 100%
rename from src/cpp/src/utils/paged_attention_transformations.hpp
rename to src/cpp/src/paged_attention_transformations.hpp
diff --git a/src/cpp/src/scheduler.hpp b/src/cpp/src/scheduler.hpp
index ba6fe44cff..23db68deab 100644
--- a/src/cpp/src/scheduler.hpp
+++ b/src/cpp/src/scheduler.hpp
@@ -14,6 +14,7 @@
 #include "sequence_group.hpp"
 #include "cache_manager.hpp"
 #include "timer.hpp"
+#include "utils.hpp"
 
 namespace ov::genai {
 class Scheduler {
@@ -462,12 +463,12 @@ class Scheduler {
     }
 
     size_t _get_available_gpu_memory() {
-        auto device_config = m_cache_manager->get_device_config();
-        auto core = m_cache_manager->get_core();
-        auto device = device_config->get_device();
+        auto device = m_cache_manager->get_device();
         OPENVINO_ASSERT(device.find("GPU") != std::string::npos, "_get_available_gpu_memory() is applicable for GPU only.");
-        auto memory_statistics = core->get_property(device, ov::intel_gpu::memory_statistics);
-        auto device_type = core->get_property(device, ov::device::type);
+
+        ov::Core core = utils::singleton_core();
+        auto memory_statistics = core.get_property(device, ov::intel_gpu::memory_statistics);
+        auto device_type = core.get_property(device, ov::device::type);
 
         // sum up all used device memory
         std::vector<std::string> device_memory_types = {"cl_mem", "usm_device"};
@@ -487,7 +488,7 @@ class Scheduler {
         used_device_mem *= used_memory_threshold;
 
         // total device memory in bytes
-        auto total_device_memory = core->get_property(device, ov::intel_gpu::device_total_mem_size);
+        auto total_device_memory = core.get_property(device, ov::intel_gpu::device_total_mem_size);
 
         return total_device_memory - used_device_mem;
     }
@@ -514,32 +515,29 @@ class Scheduler {
         if (!m_dynamic_memory_allocation) {
             return false;
         }
-        auto device_config = m_cache_manager->get_device_config();
-        auto device = device_config->get_device();
+        auto device = m_cache_manager->get_device();
         size_t current_num_of_kv_blocks = m_block_manager->get_total_number_of_kv_blocks();
         size_t new_blocks_num = current_num_of_kv_blocks * m_cache_growth_factor;
 
         if (device.find("GPU") == std::string::npos) {
             m_block_manager->increase_kv_blocks_number(new_blocks_num);
-        }
-        else {
-            size_t available_gpu_memory = _get_available_gpu_memory();
-            size_t required_memory = (new_blocks_num - current_num_of_kv_blocks) * device_config->get_block_size_in_bytes();
+        } else {
+            const size_t available_gpu_memory = _get_available_gpu_memory();
+            const size_t block_size_in_bytes = m_cache_manager->get_block_size_in_bytes();
+            size_t required_memory = (new_blocks_num - current_num_of_kv_blocks) * block_size_in_bytes;
             if (required_memory <= available_gpu_memory) {
                 m_block_manager->increase_kv_blocks_number(new_blocks_num);
             } else {
-                size_t possible_blocks_to_add = available_gpu_memory / device_config->get_block_size_in_bytes();
+                size_t possible_blocks_to_add = available_gpu_memory / block_size_in_bytes;
                 if (possible_blocks_to_add > 0) {
                     m_block_manager->increase_kv_blocks_number(current_num_of_kv_blocks + possible_blocks_to_add);
-                }
-                else {
+                } else {
                     return false;
                 }
             }
         }
         return true;
     }
-
 };
 
 }
diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
index bec2b75e0d..2ecdbd66f3 100644
--- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
+++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
@@ -5,7 +5,6 @@
 
 namespace ov::genai {
 ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::ContinuousBatchingForSpeculativeDecodingImpl(
-    ov::Core& core,
     const std::shared_ptr<ov::Model>& model,
     const Tokenizer& tokenizer,
     const GenerationConfig& generation_config,
@@ -17,7 +16,7 @@ ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::Contin
     m_tokenizer = tokenizer;
     m_generation_config = generation_config;
     m_is_validation_mode_enabled = is_validation_mode_enabled;
-    initialize_pipeline(model, scheduler_config, plugin_config, device_config, core);
+    initialize_pipeline(model, scheduler_config, plugin_config, device_config);
 }
 
 void
diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp
index e4e4be63d8..b714316e75 100644
--- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp
+++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp
@@ -13,8 +13,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl :
 public:
     ContinuousBatchingForSpeculativeDecodingImpl() = default;
 
-    ContinuousBatchingForSpeculativeDecodingImpl(ov::Core& core,
-                                                 const std::shared_ptr<ov::Model>& model,
+    ContinuousBatchingForSpeculativeDecodingImpl(const std::shared_ptr<ov::Model>& model,
                                                  const Tokenizer& tokenizer,
                                                  const GenerationConfig& generation_config,
                                                  const DeviceConfig& device_config,
diff --git a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp
index ddb3d0ae10..32d13feed1 100644
--- a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp
+++ b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp
@@ -5,8 +5,8 @@
 
 #include "text_callback_streamer.hpp"
 #include "speculative_decoding_impl.hpp"
+#include "paged_attention_transformations.hpp"
 #include "utils.hpp"
-#include "utils/paged_attention_transformations.hpp"
 
 
 namespace ov::genai {
@@ -35,6 +35,7 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con
 
     utils::apply_paged_attention_transformations(main_model, main_model_desc.scheduler_config.use_cache_eviction);
     utils::apply_paged_attention_transformations(draft_model, main_model_desc.scheduler_config.use_cache_eviction);
+
     utils::apply_gather_before_matmul_transformation(main_model);
     utils::apply_gather_before_matmul_transformation(draft_model);
 
@@ -63,9 +64,7 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con
 
     ov::AnyMap draft_properties = draft_model_desc.properties.empty() ? main_model_desc.properties : draft_model_desc.properties;
 
-    ov::Core core = utils::singleton_core();
-    DeviceConfig main_device_config(core, main_scheduler_config_updated, main_device, main_model_desc.properties),
-                 draft_device_config(core, draft_scheduler_config, draft_device, draft_properties);
+    DeviceConfig main_device_config(main_device), draft_device_config(draft_device);
 
     utils::set_kv_cache_type_and_shape(main_model, main_device_config);
     utils::set_kv_cache_type_and_shape(draft_model, draft_device_config);
@@ -81,10 +80,10 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con
     m_tokenizer = main_model_tokenizer;
 
     // to create `main_pipeline` with enabled validation_mode and `draft_pipeline` with disabled validation mode
-    m_main_pipeline = std::make_shared<ContinuousBatchingForSpeculativeDecodingImpl>(core,
+    m_main_pipeline = std::make_shared<ContinuousBatchingForSpeculativeDecodingImpl>(
         main_model, main_model_tokenizer, main_model_desc.generation_config,
         main_device_config, main_scheduler_config_updated, main_device, main_model_desc.properties, true);
-    m_draft_pipeline = std::make_shared<ContinuousBatchingForSpeculativeDecodingImpl>(core,
+    m_draft_pipeline = std::make_shared<ContinuousBatchingForSpeculativeDecodingImpl>(
         draft_model, draft_model_tokenizer, draft_model_desc.generation_config,
         draft_device_config, draft_scheduler_config, draft_device, draft_properties, false);
 }
diff --git a/tests/cpp/CMakeLists.txt b/tests/cpp/CMakeLists.txt
index d63ae17dcf..29e481cec3 100644
--- a/tests/cpp/CMakeLists.txt
+++ b/tests/cpp/CMakeLists.txt
@@ -20,15 +20,15 @@ file(GLOB src_files "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/sequence_group.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/sampler.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/speculative_decoding/*.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/prompt_lookup/*.cpp"
-                    "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/utils/*.cpp"
+                    "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/paged_attention_transformations.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/utils.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/continuous_batching*.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/icontinuous_batching.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/lora_helper.cpp"
                     "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src/text_callback_streamer.cpp")
 
-add_executable(${TEST_TARGET_NAME} ${tests_src}
-        block_allocator.cpp)
+add_executable(${TEST_TARGET_NAME} ${tests_src})
+
 target_link_libraries(${TEST_TARGET_NAME} PRIVATE openvino::genai gtest_main gmock_main)
 target_include_directories(${TEST_TARGET_NAME} PRIVATE "${OpenVINOGenAI_SOURCE_DIR}/src/cpp/src")
 target_sources(${TEST_TARGET_NAME} PRIVATE ${src_files})
diff --git a/tests/cpp/cache_manager.cpp b/tests/cpp/cache_manager.cpp
index 0c483f0ec1..864a7b43af 100644
--- a/tests/cpp/cache_manager.cpp
+++ b/tests/cpp/cache_manager.cpp
@@ -7,37 +7,13 @@
 #include "scheduler.hpp"
 #include "device_config.hpp"
 #include "cache_manager.hpp"
-#include "openvino/op/concat.hpp"
+#include "helper.hpp"
 
 using namespace ov::genai;
 
-std::shared_ptr<ov::Model> get_dummy_model(ov::Core core, size_t num_layers) {
-    ov::NodeVector keys;
-    ov::NodeVector values;
-    ov::ParameterVector params;
-    ov::element::Type inference_precision = core.get_property("CPU", ov::hint::inference_precision);
-    ov::element::Type kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16;
-
-    auto shape = ov::PartialShape({ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic()});
-    for (size_t i = 0; i < num_layers; i++) {
-        auto key = std::make_shared<ov::op::v0::Parameter>(kv_cache_type, shape);
-        auto value = std::make_shared<ov::op::v0::Parameter>(kv_cache_type, shape);
-        key->get_output_tensor(0).set_names({"key_cache." + std::to_string(i)});
-        value->get_output_tensor(0).set_names({"value_cache." + std::to_string(i)});
-        keys.push_back(key);
-        values.push_back(value);
-        params.push_back(key);
-        params.push_back(value);
-    }
-    const auto& concat1 = std::make_shared<ov::op::v0::Concat>(keys, 1);
-    const auto& concat2 = std::make_shared<ov::op::v0::Concat>(values, 1);
-    auto model = std::make_shared<ov::Model>(ov::NodeVector{concat1, concat2}, params);
-    return std::make_shared<ov::Model>(ov::NodeVector{concat1, concat2}, params);
-}
-
-size_t get_total_allocated_bytes(std::shared_ptr<ov::genai::CacheManager> cache_manager, size_t num_decoder_layers) {
+size_t get_total_allocated_bytes(std::shared_ptr<CacheManager> cache_manager) {
     size_t allocated_bytes = 0;
-    for (size_t i = 0; i < num_decoder_layers; i++) {
+    for (size_t i = 0; i < cache_manager->get_num_decoder_layers(); i++) {
         auto key_cache = cache_manager->get_key_cache(i);
         auto value_cache = cache_manager->get_value_cache(i);
         allocated_bytes += key_cache.get_byte_size() + value_cache.get_byte_size();
@@ -45,93 +21,98 @@ size_t get_total_allocated_bytes(std::shared_ptr<ov::genai::CacheManager> cache_
     return allocated_bytes;
 }
 
+size_t get_num_kv_blocks(size_t cache_size, size_t block_size_bytes) {
+    size_t kv_cache_size_in_bytes = cache_size * 1024 * 1024 * 1024; // convert GBs to bytes
+    return kv_cache_size_in_bytes / block_size_bytes;
+}
 
 TEST(TestCacheManager, test_cache_size_param) {
     ov::Core core;
-    ov::genai::SchedulerConfig scheduler_config;
+    SchedulerConfig scheduler_config;
     scheduler_config.max_num_batched_tokens = 32;
     scheduler_config.num_kv_blocks = 0;
     scheduler_config.cache_size = 2;
     scheduler_config.max_num_seqs = 2;
 
     const std::string device = "CPU";
-    ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU");
+    DeviceConfig device_config("CPU");
     const size_t num_decoder_layers = 12;
     const std::vector<KVHeadConfig> kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
     device_config.set_kv_head_configs(kv_heads_config);
 
     ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request();
-    auto cache_manager = std::make_shared<ov::genai::CacheManager>(device_config, request, core);
-    auto block_manager = BlockManager(device_config.get_num_kv_blocks(), false, device_config.get_block_size(), device_config.get_num_layers());
+    auto cache_manager = std::make_shared<CacheManager>(request, device_config);
+    ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers());
+    const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, cache_manager->get_block_size_in_bytes());
+
+    auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers());
     cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks());
-    
-    ASSERT_EQ(get_total_allocated_bytes(cache_manager, num_decoder_layers), 2146959360);
+
+    const size_t kv_cache_total_size = scheduler_config.cache_size * 1024 * 1024 * 1024;
+    const size_t cpu_block_size_total = cache_manager->get_block_size_in_bytes();
+    size_t expected_size = kv_cache_total_size / cpu_block_size_total * cpu_block_size_total;
+    ASSERT_EQ(get_total_allocated_bytes(cache_manager), expected_size);
 }
 
 
 TEST(TestCacheManager, test_kv_blocks_param) {
     ov::Core core;
-    ov::genai::SchedulerConfig scheduler_config;
+    SchedulerConfig scheduler_config;
     scheduler_config.max_num_batched_tokens = 32;
     scheduler_config.num_kv_blocks = 150;
     scheduler_config.cache_size = 0;
     scheduler_config.max_num_seqs = 2;
 
     const std::string device = "CPU";
-    ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU");
+    DeviceConfig device_config("CPU");
     const size_t num_decoder_layers = 12;
     const std::vector<KVHeadConfig> kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
     device_config.set_kv_head_configs(kv_heads_config);
 
-    ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request();
-    auto cache_manager = std::make_shared<ov::genai::CacheManager>(device_config, request, core);
-    auto block_manager = BlockManager(device_config.get_num_kv_blocks(), false, device_config.get_block_size(), device_config.get_num_layers());
-    OPENVINO_ASSERT(block_manager.get_total_number_of_kv_blocks(), scheduler_config.num_kv_blocks);
+    auto block_manager = BlockManager(scheduler_config.num_kv_blocks, false, device_config.get_block_size(), num_decoder_layers);
+    ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), scheduler_config.num_kv_blocks);
 }
 
 
 TEST(TestCacheManager, test_dynamic_cache_increase) {
     ov::Core core;
-    ov::genai::SchedulerConfig scheduler_config;
+    SchedulerConfig scheduler_config;
     scheduler_config.max_num_batched_tokens = 32;
     scheduler_config.num_kv_blocks = 0;
     scheduler_config.cache_size = 0;
     scheduler_config.max_num_seqs = 2;
 
     const std::string device = "CPU";
-    ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU");
+    DeviceConfig device_config("CPU");
     const size_t num_decoder_layers = 12;
     const std::vector<KVHeadConfig> kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
     device_config.set_kv_head_configs(kv_heads_config);
 
-    size_t block_size_in_bytes = 0;
-    for (size_t layer_id = 0; layer_id < num_decoder_layers; layer_id++) {
-        KVHeadConfig config = kv_heads_config[layer_id];
-        block_size_in_bytes += config.k_head_size * config.num_k_heads + config.v_head_size * config.num_v_heads;
-    }
-    block_size_in_bytes *= device_config.get_block_size() * device_config.get_cache_precision().size();
-
     ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request();
-    auto cache_manager = std::make_shared<ov::genai::CacheManager>(device_config, request, core);
-    auto block_manager = BlockManager(device_config.get_num_kv_blocks(), false, device_config.get_block_size(), device_config.get_num_layers());
+    auto cache_manager = std::make_shared<CacheManager>(request, device_config);
+    size_t block_size_in_bytes = cache_manager->get_block_size_in_bytes();
+    const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, block_size_in_bytes);
+
+    auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers());
+    ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers());
 
     // check initial cache allocation
     block_manager.increase_kv_blocks_number(100);
-    OPENVINO_ASSERT(block_manager.get_total_number_of_kv_blocks(), 100);
+    ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), 100);
 
     cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks());
-    OPENVINO_ASSERT(get_total_allocated_bytes(cache_manager, num_decoder_layers), 100 * block_size_in_bytes);
+    ASSERT_EQ(get_total_allocated_bytes(cache_manager), 100 * block_size_in_bytes);
 
 
     // check cache increase
     block_manager.increase_kv_blocks_number(200);
-    OPENVINO_ASSERT(block_manager.get_total_number_of_kv_blocks(), 200);
+    ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), 200);
 
     cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks());
-    OPENVINO_ASSERT(get_total_allocated_bytes(cache_manager, num_decoder_layers), 200 * block_size_in_bytes);
+    ASSERT_EQ(get_total_allocated_bytes(cache_manager), 200 * block_size_in_bytes);
 
 
     // check that cache does not increase if new blocks were not allocated
     cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks());
-    OPENVINO_ASSERT(get_total_allocated_bytes(cache_manager, num_decoder_layers), 200 * block_size_in_bytes);
+    ASSERT_EQ(get_total_allocated_bytes(cache_manager), 200 * block_size_in_bytes);
 }
\ No newline at end of file
diff --git a/tests/cpp/device_config.cpp b/tests/cpp/device_config.cpp
deleted file mode 100644
index a97037b1e8..0000000000
--- a/tests/cpp/device_config.cpp
+++ /dev/null
@@ -1,33 +0,0 @@
-// Copyright (C) 2018-2025 Intel Corporation
-// SPDX-License-Identifier: Apache-2.0
-//
-
-#include <gtest/gtest.h>
-#include "openvino/runtime/core.hpp"
-#include "scheduler.hpp"
-#include "device_config.hpp"
-
-TEST(TestDeviceConfig, kv_cache_precision_u8) {
-    ov::Core core;
-    ov::genai::SchedulerConfig scheduler_config;
-    scheduler_config.max_num_batched_tokens = 32;
-    scheduler_config.num_kv_blocks = 0;
-    scheduler_config.cache_size = 2;
-    scheduler_config.max_num_seqs = 2;
-
-    const std::string device = "CPU";
-    size_t num_decoder_layers = 12;
-    size_t head_size = 64, head_size_u8 = head_size + 8;
-
-    ov::genai::KVHeadConfig kv_head_config { 12, 12, head_size_u8, head_size_u8 };
-    ov::genai::KVHeadConfig kv_head_config_u8 { 12, 12, head_size, head_size };
-
-    ov::genai::DeviceConfig device_config_default(core, scheduler_config, "CPU");
-    ov::genai::DeviceConfig device_config_u8(core, scheduler_config, "CPU", { ov::hint::kv_cache_precision(ov::element::u8) });
-
-    device_config_default.set_kv_head_configs(std::vector<ov::genai::KVHeadConfig>(num_decoder_layers, kv_head_config));
-    device_config_u8.set_kv_head_configs(std::vector<ov::genai::KVHeadConfig>(num_decoder_layers, kv_head_config_u8));
-
-    const auto ratio = ov::element::f16.size() / ov::element::u8.size();
-    ASSERT_EQ(device_config_default.get_num_kv_blocks() * ratio, device_config_u8.get_num_kv_blocks());
-}
diff --git a/tests/cpp/helper.cpp b/tests/cpp/helper.cpp
new file mode 100644
index 0000000000..da242da479
--- /dev/null
+++ b/tests/cpp/helper.cpp
@@ -0,0 +1,27 @@
+// Copyright (C) 2023-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+
+#include "helper.hpp"
+#include "openvino/op/concat.hpp"
+
+std::shared_ptr<ov::Model> get_dummy_model(ov::Core core, size_t num_layers) {
+    ov::NodeVector keys, values;
+    ov::ParameterVector params;
+    ov::element::Type kv_cache_type = core.get_property("CPU", ov::hint::kv_cache_precision);
+
+    auto shape = ov::PartialShape::dynamic(4);
+    for (size_t i = 0; i < num_layers; i++) {
+        auto key = std::make_shared<ov::op::v0::Parameter>(kv_cache_type, shape);
+        auto value = std::make_shared<ov::op::v0::Parameter>(kv_cache_type, shape);
+        key->get_output_tensor(0).set_names({"key_cache." + std::to_string(i)});
+        value->get_output_tensor(0).set_names({"value_cache." + std::to_string(i)});
+        keys.push_back(key);
+        values.push_back(value);
+        params.push_back(key);
+        params.push_back(value);
+    }
+    const auto& concat1 = std::make_shared<ov::op::v0::Concat>(keys, 1);
+    const auto& concat2 = std::make_shared<ov::op::v0::Concat>(values, 1);
+    auto model = std::make_shared<ov::Model>(ov::NodeVector{concat1, concat2}, params);
+    return std::make_shared<ov::Model>(ov::NodeVector{concat1, concat2}, params);
+}
diff --git a/tests/cpp/helper.hpp b/tests/cpp/helper.hpp
new file mode 100644
index 0000000000..1fafe8bcf6
--- /dev/null
+++ b/tests/cpp/helper.hpp
@@ -0,0 +1,8 @@
+// Copyright (C) 2023-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+
+#pragma once
+
+#include "openvino/runtime/core.hpp"
+
+std::shared_ptr<ov::Model> get_dummy_model(ov::Core core, size_t num_layers);
\ No newline at end of file
diff --git a/tests/cpp/scheduler.cpp b/tests/cpp/scheduler.cpp
index 201318347a..b6aa5a9b53 100644
--- a/tests/cpp/scheduler.cpp
+++ b/tests/cpp/scheduler.cpp
@@ -9,6 +9,7 @@
 #include "openvino/genai/generation_config.hpp"
 #include "sequence_group.hpp"
 #include "scheduler.hpp"
+#include "helper.hpp"
 
 using namespace ov::genai;
 
@@ -18,39 +19,16 @@ void clear_finished_sequences(std::vector<SequenceGroup::Ptr>& requests) {
     });
     requests.erase(new_end, requests.end());
 }
-std::shared_ptr<ov::Model> get_model(ov::Core core, size_t num_layers) {
-    ov::NodeVector keys;
-    ov::NodeVector values;
-    ov::ParameterVector params;
-    ov::element::Type inference_precision = core.get_property("CPU", ov::hint::inference_precision);
-    ov::element::Type kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16;
-
-    auto shape = ov::PartialShape({ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic(), ov::Dimension::dynamic()});
-    for (size_t i = 0; i < num_layers; i++) {
-        auto key = std::make_shared<ov::op::v0::Parameter>(kv_cache_type, shape);
-        auto value = std::make_shared<ov::op::v0::Parameter>(kv_cache_type, shape);
-        key->get_output_tensor(0).set_names({"key_cache." + std::to_string(i)});
-        value->get_output_tensor(0).set_names({"value_cache." + std::to_string(i)});
-        keys.push_back(key);
-        values.push_back(value);
-        params.push_back(key);
-        params.push_back(value);
-    }
-    const auto& concat1 = std::make_shared<ov::op::v0::Concat>(keys, 1);
-    const auto& concat2 = std::make_shared<ov::op::v0::Concat>(values, 1);
-    auto model = std::make_shared<ov::Model>(ov::NodeVector{concat1, concat2}, params);
-    return std::make_shared<ov::Model>(ov::NodeVector{concat1, concat2}, params);
-}
 
 std::shared_ptr<CacheManager> init_cache_manager(SchedulerConfig scheduler_config) {
     ov::Core core = ov::Core();
     size_t num_decoder_layers = 12;
-    ov::InferRequest request = core.compile_model(get_model(core, num_decoder_layers)).create_infer_request();
-    size_t head_size = 64, head_size_u8 = head_size + 8;
-    std::vector<KVHeadConfig> kv_head_configs(num_decoder_layers, KVHeadConfig { 12, 12, head_size_u8, head_size_u8 });
-    ov::genai::DeviceConfig device_config(core, scheduler_config, "CPU");
+    ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request();
+    const size_t head_size = 64;
+    std::vector<KVHeadConfig> kv_head_configs(num_decoder_layers, KVHeadConfig { 12, 12, head_size, head_size });
+    ov::genai::DeviceConfig device_config("CPU");
     device_config.set_kv_head_configs(kv_head_configs);
-    return std::make_shared<CacheManager>(device_config, request, core);
+    return std::make_shared<CacheManager>(request, device_config);
 }
 
 TEST(TestScheduler, general_test) {
diff --git a/tests/cpp/speculative_decoding.cpp b/tests/cpp/speculative_decoding.cpp
index 1cf8db0fab..114f16800b 100644
--- a/tests/cpp/speculative_decoding.cpp
+++ b/tests/cpp/speculative_decoding.cpp
@@ -13,8 +13,7 @@ class CBForSDTest : public testing::Test, public ov::genai::ContinuousBatchingPi
             m_sampler = std::make_shared<ov::genai::Sampler>();
         };
 
-        ov::genai::GenerationHandle
-        add_request(uint64_t request_id, const ov::Tensor& input_ids) {
+        ov::genai::GenerationHandle add_request(uint64_t request_id, const ov::Tensor& input_ids) {
             auto sampling_params = ov::genai::greedy();
             sampling_params.num_assistant_tokens = 1;
 

From e866ec088bc0a89f307509160401390f816373d3 Mon Sep 17 00:00:00 2001
From: Ekaterina Aidova <ekaterina.aidova@intel.com>
Date: Wed, 29 Jan 2025 17:10:19 +0400
Subject: [PATCH 04/15] [LLM bench]support providing adapter config mode
 (#1644)

CVS-161355
---
 tools/llm_bench/benchmark.py                  |  1 +
 .../llm_bench/llm_bench_utils/model_utils.py  |  1 +
 tools/llm_bench/llm_bench_utils/ov_utils.py   | 19 +++++++++++++++----
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/tools/llm_bench/benchmark.py b/tools/llm_bench/benchmark.py
index d01c6316ff..3a4079d6b6 100644
--- a/tools/llm_bench/benchmark.py
+++ b/tools/llm_bench/benchmark.py
@@ -140,6 +140,7 @@ def get_argprser():
         default=None,
         help="Path to LoRA adapters for using OpenVINO GenAI optimized pipelines with LoRA for benchmarking")
     parser.add_argument('--lora_alphas', nargs='*', help='Alphas params for LoRA adapters.', required=False, default=[])
+    parser.add_argument("--lora_mode", choices=["auto", "fuse", "static", "static_rank", "dynamic"], help="LoRA adapters loading mode")
     parser.add_argument("--use_cb", action="store_true", help="Use Continuous Batching inference mode")
     parser.add_argument("--cb_config", required=False, default=None, help="Path to file with Continuous Batching Scheduler settings or dict")
     parser.add_argument("--draft_model", required=False, default=None,
diff --git a/tools/llm_bench/llm_bench_utils/model_utils.py b/tools/llm_bench/llm_bench_utils/model_utils.py
index aaf72113dc..4bd696a569 100644
--- a/tools/llm_bench/llm_bench_utils/model_utils.py
+++ b/tools/llm_bench/llm_bench_utils/model_utils.py
@@ -131,6 +131,7 @@ def analyze_args(args):
     model_args['output_dir'] = args.output_dir
     model_args['lora'] = args.lora
     model_args['lora_alphas'] = args.lora_alphas
+    model_args['lora_mode'] = args.lora_mode
     use_cb = args.use_cb or args.draft_model
     if args.device == "NPU" and use_cb:
         log.warning("Continious batching and Speculative Decoding are not supported for NPU device")
diff --git a/tools/llm_bench/llm_bench_utils/ov_utils.py b/tools/llm_bench/llm_bench_utils/ov_utils.py
index c70e4beb5e..eea3dd50f3 100644
--- a/tools/llm_bench/llm_bench_utils/ov_utils.py
+++ b/tools/llm_bench/llm_bench_utils/ov_utils.py
@@ -135,9 +135,20 @@ def decode_ov_tokenizer(self, token_ids, *args, **kwargs):
     return hf_tokenizer
 
 
-def get_lora_config(lora_paths, lora_alphas):
+def get_lora_config(lora_paths, lora_alphas, lora_mode=None):
     import openvino_genai
 
+    modes = {
+        "auto": openvino_genai.AdapterConfig.Mode.MODE_AUTO,
+        "fuse": openvino_genai.AdapterConfig.Mode.MODE_FUSE,
+        "dynamic": openvino_genai.AdapterConfig.Mode.MODE_DYNAMIC,
+        "static": openvino_genai.AdapterConfig.Mode.MODE_STATIC,
+        "static_rank": openvino_genai.AdapterConfig.Mode.MODE_DYNAMIC
+    }
+    if lora_mode is not None:
+        lora_mode = modes[lora_mode]
+        log.info(f"LoRA adapters loading mode: {lora_mode}")
+
     adapter_config = list()
     if not lora_paths:
         return adapter_config
@@ -150,7 +161,7 @@ def get_lora_config(lora_paths, lora_alphas):
         if not Path(lora_paths[idx]).exists():
             log.warning(f'LoRA path is not exists: {lora_paths[idx]}. LoRA will be ignored.')
             continue
-        adapter_config = openvino_genai.AdapterConfig()
+        adapter_config = openvino_genai.AdapterConfig() if lora_mode is None else openvino_genai.AdapterConfig(mode=lora_mode)
         adapter = openvino_genai.Adapter(lora_paths[idx])
         alpha = float(lora_alphas[idx])
         adapter_config.add(adapter, alpha)
@@ -263,7 +274,7 @@ def create_genai_text_gen_model(model_path, device, ov_config, **kwargs):
             if kwargs.get("draft_cb_config") is not None else {}
         ov_config['draft_model'] = openvino_genai.draft_model(draft_model_path, draft_device.upper(), **draft_model_load_kwargs)
 
-    adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", []))
+    adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", []), kwargs.get("lora_mode", None))
     if adapter_config:
         ov_config['adapters'] = adapter_config
 
@@ -413,7 +424,7 @@ def get_unet_step_count(self):
         def get_vae_decoder_step_count(self):
             return 1
 
-    adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", []))
+    adapter_config = get_lora_config(kwargs.get("lora", None), kwargs.get("lora_alphas", []), kwargs.get("lora_mode", None))
     if adapter_config:
         ov_config['adapters'] = adapter_config
 

From 106e56126be652e18998762d05eafb2aa681315d Mon Sep 17 00:00:00 2001
From: Alexander Kozlov <alexander.kozlov@intel.com>
Date: Wed, 29 Jan 2025 17:41:29 +0400
Subject: [PATCH 05/15] [WWB]: Fixed chat template usage in VLM GenAI pipeline
 (#1643)

Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
---
 tools/who_what_benchmark/whowhatbench/wwb.py | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py
index 7d4354f846..1eb778a060 100644
--- a/tools/who_what_benchmark/whowhatbench/wwb.py
+++ b/tools/who_what_benchmark/whowhatbench/wwb.py
@@ -337,11 +337,9 @@ def genai_gen_visual_text(model, prompt, image, processor, tokenizer, max_new_to
     config.max_new_tokens = max_new_tokens
     config.do_sample = False
     model.set_generation_config(config)
-    if tokenizer.chat_template is not None:
-        model.start_chat(tokenizer.chat_template)
-    else:
-        model.start_chat()
-    out = model.generate(prompt, images=[image_data])
+
+    model.start_chat()
+    out = model.generate(prompt, image=image_data)
     model.finish_chat()
     return out.texts[0]
 

From 020bdabcb6a9fab889f97c43cf59122a9fdf9c88 Mon Sep 17 00:00:00 2001
From: Sofya Balandina <sofya.balandina@intel.com>
Date: Wed, 29 Jan 2025 13:53:56 +0000
Subject: [PATCH 06/15] Automatically apply chat template in non-chat scenarios
 (#1533)

[CVS-157276](https://jira.devtools.intel.com/browse/CVS-157276)
---
 .github/workflows/causal_lm_cpp.yml           | 48 ++++++++++++++-----
 README.md                                     |  1 -
 samples/cpp/text_generation/README.md         |  2 +-
 samples/python/text_generation/README.md      |  2 +-
 src/README.md                                 |  2 +
 .../openvino/genai/generation_config.hpp      |  7 +++
 .../include/openvino/genai/llm_pipeline.hpp   |  4 ++
 src/cpp/include/openvino/genai/tokenizer.hpp  |  3 ++
 .../genai/visual_language/pipeline.hpp        |  8 ++++
 .../genai/whisper_generation_config.hpp       |  2 +-
 src/cpp/src/generation_config.cpp             |  1 +
 src/cpp/src/icontinuous_batching.cpp          | 16 ++++++-
 src/cpp/src/llm_pipeline_stateful.cpp         | 26 ++++++++--
 src/cpp/src/llm_pipeline_static.cpp           | 20 +++++++-
 src/cpp/src/tokenizer.cpp                     |  8 ++++
 .../src/visual_language/inputs_embedder.cpp   | 29 ++++++++++-
 .../src/visual_language/inputs_embedder.hpp   |  3 ++
 src/cpp/src/visual_language/pipeline.cpp      |  2 +
 src/cpp/src/whisper_generation_config.cpp     |  6 +++
 .../openvino_genai/py_openvino_genai.pyi      |  5 ++
 src/python/py_generation_config.cpp           |  2 +
 src/python/py_tokenizer.cpp                   |  6 +++
 tests/python_tests/common.py                  | 18 +++++--
 tests/python_tests/test_generation_config.py  |  2 +
 tests/python_tests/test_llm_pipeline.py       |  8 ++--
 tests/python_tests/test_sampling.py           |  4 +-
 tools/llm_bench/task/text_generation.py       |  2 +
 .../task/visual_language_generation.py        |  1 +
 tools/who_what_benchmark/whowhatbench/wwb.py  |  3 +-
 29 files changed, 207 insertions(+), 34 deletions(-)

diff --git a/.github/workflows/causal_lm_cpp.yml b/.github/workflows/causal_lm_cpp.yml
index 2e0afaa882..5dff0a58d3 100644
--- a/.github/workflows/causal_lm_cpp.yml
+++ b/.github/workflows/causal_lm_cpp.yml
@@ -120,7 +120,10 @@ jobs:
           with open('pred.txt', 'r') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
-          tokenized = tokenizer('Why is the Sun yellow?', return_tensors='pt')
+          prompt = 'Why is the Sun yellow?'
+          if tokenizer.chat_template:
+              prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+          tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
           for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False):
               ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True)
               idx = predictions.find(ref)
@@ -136,7 +139,10 @@ jobs:
           with open('pred.txt', 'r') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
-          tokenized = tokenizer('69', return_tensors='pt')
+          prompt = '69'
+          if tokenizer.chat_template:
+              prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+          tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
           for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False):
               ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True)
               idx = predictions.find(ref)
@@ -152,7 +158,10 @@ jobs:
           with open('pred.txt', 'r') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
-          tokenized = tokenizer('Hi', return_tensors='pt')
+          prompt = 'Hi'
+          if tokenizer.chat_template:
+            prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+          tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
           for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False):
               ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True)
               idx = predictions.find(ref)
@@ -168,7 +177,10 @@ jobs:
           with open('pred.txt', 'r') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
-          tokenized = tokenizer('return 0', return_tensors='pt')
+          prompt = 'return 0'
+          if tokenizer.chat_template:
+              prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+          tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
           for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False):
               ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True)
               idx = predictions.find(ref)
@@ -184,7 +196,10 @@ jobs:
           with open('pred.txt', 'r', errors='ignore') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
-          tokenized = tokenizer('你好！ 你好嗎？', return_tensors='pt')
+          prompt = '你好！ 你好嗎？'
+          if tokenizer.chat_template:
+              prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+          tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
           for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False):
               ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True)
               idx = predictions.find(ref.replace('�', ''))
@@ -194,19 +209,21 @@ jobs:
           "
           echo "你好！ 你好嗎？" passed
 
-          timeout 1m ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Alan Turing was a" "return 0" "你好！ 你好嗎？" > ./pred.txt
+          timeout 1m ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Why is the Sun yellow?" "return 0" "你好！ 你好嗎？" > ./pred.txt
           python -c "
           import transformers
           with open('pred.txt', 'r', errors='ignore') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
           prompts = [
-            'Alan Turing was a',
+            'Why is the Sun yellow?',
             'return 0',
             '你好！ 你好嗎？'
           ]
           for prompt in prompts:
-            tokenized = tokenizer(prompt, return_tensors='pt')
+            if tokenizer.chat_template:
+                prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+            tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
             for beam in transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0').generate(**tokenized, num_beam_groups=3, num_beams=15, num_return_sequences=15, diversity_penalty=1.0, max_new_tokens=20, early_stopping=False, length_penalty=1.0, no_repeat_ngram_size=9**9, do_sample=False):
                 ref = ': ' + tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True)
                 idx = predictions.find(ref.replace('�', ''))
@@ -255,7 +272,10 @@ jobs:
           echo import transformers > ref.py
           echo predictions = open('cpp.txt', 'r').read() >> ref.py
           echo tokenizer = transformers.AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0', trust_remote_code=True) >> ref.py
-          echo tokenized = tokenizer('69', return_tensors='pt') >> ref.py
+          echo prompt = '69' >> ref.py
+          echo if tokenizer.chat_template: >> ref.py
+          echo     prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True) >> ref.py
+          echo tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False) >> ref.py
           echo for beam in transformers.AutoModelForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0', trust_remote_code=True).generate(**tokenized, max_new_tokens=100, do_sample=False): >> ref.py
           echo     ref = tokenizer.decode(beam[tokenized['input_ids'].numel():], skip_special_tokens=True) >> ref.py
           echo     idx = predictions.find(ref) >> ref.py
@@ -562,7 +582,10 @@ jobs:
           with open('pred_greedy.txt', 'r') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/phi-1_5')
-          tokenized = tokenizer('Alan Turing was a', return_tensors='pt')
+          prompt = 'Alan Turing was a'
+          if tokenizer.chat_template:
+              prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+          tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
           for output in transformers.AutoModelForCausalLM.from_pretrained('microsoft/phi-1_5').generate(**tokenized, max_length=100, do_sample=False):
               ref = tokenizer.decode(output[tokenized['input_ids'].numel():], skip_special_tokens=True)
               idx = predictions.find(ref)
@@ -617,7 +640,10 @@ jobs:
           with open('pred_greedy.txt', 'r') as file:
               predictions = file.read()
           tokenizer = transformers.AutoTokenizer.from_pretrained('ikala/redpajama-3b-chat')
-          tokenized = tokenizer('Alan Turing was a', return_tensors='pt')
+          prompt = 'Alan Turing was a'
+          if tokenizer.chat_template:
+              prompt = tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+          tokenized = tokenizer(prompt, return_tensors='pt', add_special_tokens=False)
           for output in transformers.AutoModelForCausalLM.from_pretrained('ikala/redpajama-3b-chat').generate(**tokenized, max_length=100, do_sample=False):
               ref = tokenizer.decode(output[tokenized['input_ids'].numel():], skip_special_tokens=True)
               idx = predictions.find(ref)
diff --git a/README.md b/README.md
index cea1e358bc..221a81c6c3 100644
--- a/README.md
+++ b/README.md
@@ -133,7 +133,6 @@ from PIL import Image
 
 # Choose GPU instead of CPU in the line below to run the model on Intel integrated or discrete GPU
 pipe = openvino_genai.VLMPipeline("./InternVL2-1B", "CPU")
-pipe.start_chat()
 
 image = Image.open("dog.jpg")
 image_data = np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.uint8)
diff --git a/samples/cpp/text_generation/README.md b/samples/cpp/text_generation/README.md
index dd24b6ebf5..d20d8ac09d 100644
--- a/samples/cpp/text_generation/README.md
+++ b/samples/cpp/text_generation/README.md
@@ -48,7 +48,7 @@ Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat
   ./chat_sample <MODEL_DIR>
   ```
 #### Missing chat template
-If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
+If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model or update it using call `pipe.get_tokenizer().set_chat_template(new_chat_template)`.
 The following template can be used as a default, but it may not work properly with every model:
 ```
 "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
diff --git a/samples/python/text_generation/README.md b/samples/python/text_generation/README.md
index 97a6ad59bc..6b086f3471 100644
--- a/samples/python/text_generation/README.md
+++ b/samples/python/text_generation/README.md
@@ -48,7 +48,7 @@ Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat
   python chat_sample.py model_dir
   ```
 #### Missing chat template
-If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
+If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model or update it using call `pipe.get_tokenizer().set_chat_template(new_chat_template)`.
 The following template can be used as a default, but it may not work properly with every model:
 ```
 "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
diff --git a/src/README.md b/src/README.md
index af4953f98a..c2ed8c2a60 100644
--- a/src/README.md
+++ b/src/README.md
@@ -73,6 +73,8 @@ output:
 'it is made up of carbon atoms. The carbon atoms are arranged in a linear pattern, which gives the yellow color. The arrangement of carbon atoms in'
 ```
 
+>**Note**: The chat_template from tokenizer_config.json or from tokenizer/detokenizer model will be automatically applied to the prompt at the generation stage. If you want to disable it, you can do it by calling pipe.get_tokenizer().set_chat_template("").
+
 A simple chat in Python:
 ```python
 import openvino_genai as ov_genai
diff --git a/src/cpp/include/openvino/genai/generation_config.hpp b/src/cpp/include/openvino/genai/generation_config.hpp
index 3a75fc02ea..13cc8f0b01 100644
--- a/src/cpp/include/openvino/genai/generation_config.hpp
+++ b/src/cpp/include/openvino/genai/generation_config.hpp
@@ -77,6 +77,8 @@ enum class StopCriteria { EARLY, HEURISTIC, NEVER };
  * @param assistant_confidence_threshold the lower token probability of candidate to be validated by main model in case of dynamic strategy candidates number update.
  * @param num_assistant_tokens the defined candidates number to be generated by draft model/prompt lookup in case of static strategy candidates number update.
  * @param max_ngram_size is maximum ngram to use when looking for matches in the prompt.
+ *
+ * @param apply_chat_template whether or not to apply chat_template for non-chat scenarios
  */
 
 class OPENVINO_GENAI_EXPORTS GenerationConfig {
@@ -128,6 +130,9 @@ class OPENVINO_GENAI_EXPORTS GenerationConfig {
 
     std::optional<AdapterConfig> adapters;
 
+    // set to true if chat template should be applied for non-chat scenarios, set to false otherwise
+    bool apply_chat_template = true;
+
     /** @brief sets eos_token_id to tokenizer_eos_token_id if eos_token_id is less than 0.
      * Otherwise verifies eos_token_id == tokenizer_eos_token_id.
      */
@@ -189,6 +194,8 @@ extern OPENVINO_GENAI_EXPORTS ov::Property<size_t> rng_seed;
 static constexpr ov::Property<float> assistant_confidence_threshold{"assistant_confidence_threshold"};
 static constexpr ov::Property<size_t> num_assistant_tokens{"num_assistant_tokens"};
 
+static constexpr ov::Property<bool> apply_chat_template{"apply_chat_template"};
+
 // Predefined Configs
 
 OPENVINO_DEPRECATED("Please, use individual parameters instead of predefined configs. This method will be removed in 2026.0.0 release")
diff --git a/src/cpp/include/openvino/genai/llm_pipeline.hpp b/src/cpp/include/openvino/genai/llm_pipeline.hpp
index 31b1ac1675..26232574dc 100644
--- a/src/cpp/include/openvino/genai/llm_pipeline.hpp
+++ b/src/cpp/include/openvino/genai/llm_pipeline.hpp
@@ -177,6 +177,8 @@ class OPENVINO_GENAI_EXPORTS LLMPipeline {
     * @param generation_config optional GenerationConfig
     * @param streamer optional streamer
     * @return DecodedResults decoded resulting text
+    * chat_template will be applied to the prompt, run pipe.get_tokenizer().set_chat_template(custom_chat_template) to update it.
+    * To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
     */
     DecodedResults generate(
         StringInputs inputs,
@@ -191,6 +193,8 @@ class OPENVINO_GENAI_EXPORTS LLMPipeline {
     * @param inputs input prompt or a vector of prompts
     * @param properties properties
     * @return DecodedResults decoded resulting text
+    * chat_template will be applied to the prompt, run pipe.get_tokenizer().set_chat_template(custom_chat_template) to update it.
+    * To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
     */
     template <typename... Properties>
     util::EnableIfAllStringAny<DecodedResults, Properties...> generate(
diff --git a/src/cpp/include/openvino/genai/tokenizer.hpp b/src/cpp/include/openvino/genai/tokenizer.hpp
index 0a54d1da2a..bde4eb3fe1 100644
--- a/src/cpp/include/openvino/genai/tokenizer.hpp
+++ b/src/cpp/include/openvino/genai/tokenizer.hpp
@@ -221,6 +221,9 @@ class OPENVINO_GENAI_EXPORTS Tokenizer {
     /// @param chat_template The new template to override with.
     void set_chat_template(const std::string& chat_template);
 
+    // get information about a chat template to check its status, for example whether it is empty
+    std::string get_chat_template() const;
+
     // information about <bos>, <eos> tokens should be public,
     // they are used at least in StreamerBase descendants
     int64_t get_bos_token_id() const;
diff --git a/src/cpp/include/openvino/genai/visual_language/pipeline.hpp b/src/cpp/include/openvino/genai/visual_language/pipeline.hpp
index 8c3d380b0f..b6b1d5c7f6 100644
--- a/src/cpp/include/openvino/genai/visual_language/pipeline.hpp
+++ b/src/cpp/include/openvino/genai/visual_language/pipeline.hpp
@@ -98,6 +98,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
     /// @param generation_config A config to follow for text generation.
     /// @param streamer A streamer to acquire intermediate result.
     /// @return A string generated by a model.
+    /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
+    /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
     VLMDecodedResults generate(
         const std::string& prompt,
         const std::vector<ov::Tensor>& rgbs,
@@ -111,6 +113,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
     /// @param generation_config A config to follow for text generation.
     /// @param streamer A streamer to acquire intermediate result.
     /// @return A string generated by a model.
+    /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
+    /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
     VLMDecodedResults generate(
         const std::string& prompt,
         const ov::Tensor& rgb,
@@ -124,6 +128,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
     /// for its members, StreamerVariant a single image or multiple
     /// images.
     /// @return A string generated by a model.
+    /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
+    /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
     VLMDecodedResults generate(
         const std::string& prompt,
         const ov::AnyMap& config_map
@@ -137,6 +143,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
     /// @param ...properties ov::Property instances to be combined into
     /// ov::AnyMap.
     /// @return A string generated by a model.
+    /// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
+    /// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
     template <typename... Properties>
     util::EnableIfAllStringAny<VLMDecodedResults, Properties...> generate(
         const std::string& prompt,
diff --git a/src/cpp/include/openvino/genai/whisper_generation_config.hpp b/src/cpp/include/openvino/genai/whisper_generation_config.hpp
index 18b4202609..db92f2bcc4 100644
--- a/src/cpp/include/openvino/genai/whisper_generation_config.hpp
+++ b/src/cpp/include/openvino/genai/whisper_generation_config.hpp
@@ -18,7 +18,7 @@ namespace genai {
  */
 class OPENVINO_GENAI_EXPORTS WhisperGenerationConfig : public GenerationConfig {
 public:
-    WhisperGenerationConfig() = default;
+    WhisperGenerationConfig();
     explicit WhisperGenerationConfig(const std::filesystem::path& json_path);
 
     // Corresponds to the ”<|startoftranscript|>” token.
diff --git a/src/cpp/src/generation_config.cpp b/src/cpp/src/generation_config.cpp
index de23852c9b..3914e217c4 100644
--- a/src/cpp/src/generation_config.cpp
+++ b/src/cpp/src/generation_config.cpp
@@ -128,6 +128,7 @@ void GenerationConfig::update_generation_config(const ov::AnyMap& properties) {
     read_anymap_param(properties, "logprobs", logprobs);
     read_anymap_param(properties, "num_return_sequences", num_return_sequences);
     read_anymap_param(properties, "adapters", adapters);
+    read_anymap_param(properties, "apply_chat_template", apply_chat_template);
 
     // penalties
     read_anymap_param(properties, "frequency_penalty", frequency_penalty);
diff --git a/src/cpp/src/icontinuous_batching.cpp b/src/cpp/src/icontinuous_batching.cpp
index 78f8fda8f7..5bdf00d51d 100644
--- a/src/cpp/src/icontinuous_batching.cpp
+++ b/src/cpp/src/icontinuous_batching.cpp
@@ -53,9 +53,21 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
     } else {
         input_ids.reserve(prompts.size());
         timer.start();
-        for (const std::string& prompt : prompts) {
+        for (size_t i = 0; i < prompts.size(); i++) {
+            const std::string& prompt = prompts.at(i);
             const auto encode_start = std::chrono::steady_clock::now();
-            input_ids.push_back(m_tokenizer.encode(prompt).input_ids);
+            ov::Tensor encoded_inputs;
+            if (sampling_params.at(i).apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
+                ChatHistory history({{{"role", "user"}, {"content", prompt}}});
+                constexpr bool add_generation_prompt = true;
+                auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
+                encoded_inputs = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)).input_ids;
+            } else {
+                // in case when chat_template was not found in tokenizer_config.json or set
+                std::string input_str(prompt);
+                encoded_inputs = m_tokenizer.encode(input_str, ov::genai::add_special_tokens(true)).input_ids;
+            }
+            input_ids.push_back(encoded_inputs);
             tokenization_durations.emplace_back(PerfMetrics::get_microsec(std::chrono::steady_clock::now() - encode_start));
         }
         timer.end();
diff --git a/src/cpp/src/llm_pipeline_stateful.cpp b/src/cpp/src/llm_pipeline_stateful.cpp
index 2a53154c27..0dea53c7ed 100644
--- a/src/cpp/src/llm_pipeline_stateful.cpp
+++ b/src/cpp/src/llm_pipeline_stateful.cpp
@@ -88,7 +88,18 @@ DecodedResults StatefulLLMPipeline::generate(
 
     if (auto input_vector = std::get_if<std::vector<std::string>>(&inputs)) {
         OPENVINO_ASSERT(!is_chat_conversation, "Can't chat with multiple prompts");
-        encoded_input = m_tokenizer.encode(*input_vector);
+        if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
+            std::vector<std::string> templated_input_vector;
+            for (auto& input : *input_vector) {
+                ChatHistory history({{{"role", "user"}, {"content", input}}});
+                constexpr bool add_generation_prompt = true;
+                auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
+                templated_input_vector.push_back(templated_prompt);
+            }
+            encoded_input = m_tokenizer.encode(templated_input_vector, ov::genai::add_special_tokens(false));
+        } else {
+            encoded_input = m_tokenizer.encode(*input_vector, ov::genai::add_special_tokens(true));
+        }
     } else if (auto input_prompt = std::get_if<std::string>(&inputs)) {
         std::string& prompt = *input_prompt;
 
@@ -104,7 +115,7 @@ DecodedResults StatefulLLMPipeline::generate(
 
             m_history.push_back({{"role", "user"}, {"content", prompt}});
             constexpr bool add_generation_prompt = true;
-            auto new_templated_chat_history  = m_tokenizer.apply_chat_template(m_history, add_generation_prompt);
+            auto new_templated_chat_history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt);
             // Do not add special tokens in chat scenario to be aligned with HF.
             auto new_chat_tokens = m_tokenizer.encode(new_templated_chat_history, ov::genai::add_special_tokens(false));
             auto prev_chat_tokens = m_tokenizer.encode(m_templated_chat_history, ov::genai::add_special_tokens(false));
@@ -157,7 +168,16 @@ DecodedResults StatefulLLMPipeline::generate(
 
             // TODO: Forbid LoRA config change if we are in the chat mode, because it requires regenerating the history with LoRA applied
         } else {
-            encoded_input = m_tokenizer.encode(prompt);
+            std::string& prompt = *input_prompt;
+            if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
+                ChatHistory history({{{"role", "user"}, {"content", prompt}}});
+                constexpr bool add_generation_prompt = true;
+                auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
+                encoded_input = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false));
+            } else {
+                // in case when chat_template was not found in tokenizer_config.json or set
+                encoded_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(true));
+            }
         }
     }
 
diff --git a/src/cpp/src/llm_pipeline_static.cpp b/src/cpp/src/llm_pipeline_static.cpp
index b17ee959c5..47426d1cf4 100644
--- a/src/cpp/src/llm_pipeline_static.cpp
+++ b/src/cpp/src/llm_pipeline_static.cpp
@@ -827,7 +827,15 @@ DecodedResults StatefulLLMPipeline::generate(
         // for chat ov::genai::add_special_tokens(false) is aligned with stateful pipeline and HF
         tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(false));
     } else {
-        tokenized_input = m_tokenizer.encode(prompt);
+        if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
+            ChatHistory history({{{"role", "user"}, {"content", prompt}}});
+            constexpr bool add_generation_prompt = true;
+            auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
+            tokenized_input = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false));
+        } else {
+            // in case when chat_template was not found in tokenizer_config.json or set
+            tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(true));
+        }
     }
 
     auto encode_stop_time =  std::chrono::steady_clock::now();
@@ -1294,7 +1302,15 @@ DecodedResults StatelessLLMPipeline::generate(
         // for chat ov::genai::add_special_tokens(false) is aligned with stateful pipeline and HF
         tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(false));
     } else {
-        tokenized_input = m_tokenizer.encode(prompt);
+        if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
+            ChatHistory history({{{"role", "user"}, {"content", prompt}}});
+            constexpr bool add_generation_prompt = true;
+            auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
+            tokenized_input = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false));
+        } else {
+            // in case when chat_template was not found in tokenizer_config.json or set
+            tokenized_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(true));
+        }
     }
 
     auto encode_stop_time =  std::chrono::steady_clock::now();
diff --git a/src/cpp/src/tokenizer.cpp b/src/cpp/src/tokenizer.cpp
index 9676cdb5f3..2eadda53ba 100644
--- a/src/cpp/src/tokenizer.cpp
+++ b/src/cpp/src/tokenizer.cpp
@@ -573,6 +573,10 @@ class Tokenizer::TokenizerImpl {
     void set_chat_template(const std::string& chat_template) {
         m_chat_template = patch_chat_template(chat_template);
     }
+
+    std::string get_chat_template() {
+        return m_chat_template;
+    }
 };
 
 Tokenizer::Tokenizer(const std::filesystem::path& tokenizer_path, const ov::AnyMap& properties) {
@@ -676,6 +680,10 @@ std::string Tokenizer::apply_chat_template(ChatHistory history,
     return m_pimpl->apply_chat_template(history, add_generation_prompt, chat_template);
 }
 
+std::string Tokenizer::get_chat_template() const {
+    return m_pimpl->get_chat_template();
+}
+
 void Tokenizer::set_chat_template(const std::string& chat_template) {
     m_pimpl->set_chat_template(chat_template);
 }
diff --git a/src/cpp/src/visual_language/inputs_embedder.cpp b/src/cpp/src/visual_language/inputs_embedder.cpp
index 66b17e5804..e912570f20 100644
--- a/src/cpp/src/visual_language/inputs_embedder.cpp
+++ b/src/cpp/src/visual_language/inputs_embedder.cpp
@@ -43,6 +43,8 @@ class InputsEmbedder::IInputsEmbedder {
     // If we use beam search sampling with chat mode we need to remove last answer of the model from kv cache and add best answer to history 
     // so, let's keep info about amount of tokens to trim from kv cache and amount of tokens to keep in history
     ov::genai::utils::HistoryRemoveManager m_kv_history_manager = {0, 0};
+    // True if chat template should be applied for non-chat scenario
+    bool m_apply_chat_template = true;
 
 public:
     virtual ov::Tensor get_inputs_embeds(const std::string& prompt, const std::vector<ov::Tensor>& images, ov::genai::VLMPerfMetrics& metrics) = 0;
@@ -82,6 +84,10 @@ class InputsEmbedder::IInputsEmbedder {
         std::copy(encoded_result.begin(), encoded_result.end(), std::back_inserter(m_tokenized_history));
     }
 
+    void set_apply_chat_template_status(bool apply_chat_template) {
+        m_apply_chat_template = apply_chat_template;
+    }
+
     virtual void start_chat(const std::string& system_message) {
         m_is_chat_conversation = true;
         m_kv_history_manager.reset();
@@ -155,7 +161,7 @@ class InputsEmbedder::IInputsEmbedder {
             m_history.push_back({{"role", "user"}, {"content", prompt}});
             constexpr bool add_generation_prompt = true;
             std::string new_templated_chat_history;
-            try {
+           try {
                 new_templated_chat_history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt);
             } catch (const std::exception& error) {
                 // Use fallback chat template if it was not found in tokenizer_config.json
@@ -169,8 +175,23 @@ class InputsEmbedder::IInputsEmbedder {
             m_templated_chat_history = std::move(new_templated_chat_history);
             return {new_chat_tokens, prev_chat_tokens};
         } else {
+            ov::Tensor encoded_input_ids;
             auto start_tokenizer_time = std::chrono::steady_clock::now();
-            ov::Tensor encoded_input_ids = m_tokenizer.encode(prompt).input_ids;
+            if (m_apply_chat_template) {
+                std::string templated_prompt;
+                ChatHistory history({{{"role", "user"}, {"content", prompt}}});
+                constexpr bool add_generation_prompt = true;
+
+                if (!m_tokenizer.get_chat_template().empty()) {
+                    templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
+                } else {
+                    // Use fallback chat template if it was not found in tokenizer_config.json
+                    templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt, chat_template_fallback);
+                }
+                encoded_input_ids = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)).input_ids;
+            } else {
+                encoded_input_ids = m_tokenizer.encode(prompt).input_ids;
+            }
             auto end_tokenizer_time = std::chrono::steady_clock::now();
             metrics.raw_metrics.tokenization_durations.emplace_back(PerfMetrics::get_microsec(end_tokenizer_time - start_tokenizer_time));
             return {encoded_input_ids, ov::Tensor()};
@@ -2046,6 +2067,10 @@ void InputsEmbedder::update_chat_history(const std::string& decoded_results) {
     return m_impl->update_chat_history(decoded_results);
 }
 
+void InputsEmbedder::set_apply_chat_template_status(bool apply_chat_template) {
+    return m_impl->set_apply_chat_template_status(apply_chat_template);
+}
+
 void InputsEmbedder::finish_chat() {
     return m_impl->finish_chat();
 }
diff --git a/src/cpp/src/visual_language/inputs_embedder.hpp b/src/cpp/src/visual_language/inputs_embedder.hpp
index 4462c58185..5bd7cd3004 100644
--- a/src/cpp/src/visual_language/inputs_embedder.hpp
+++ b/src/cpp/src/visual_language/inputs_embedder.hpp
@@ -58,6 +58,9 @@ class InputsEmbedder {
     // adds currently generated text to chat history
     void update_chat_history(const std::string& decoded_results);
 
+    // set the apply_chat_template flag, which determines whether chat template should be applied for non-chat scenarios
+    void set_apply_chat_template_status(bool apply_chat_template);
+
     // finishes chat and clears a chat history 
     void finish_chat();
 private:
diff --git a/src/cpp/src/visual_language/pipeline.cpp b/src/cpp/src/visual_language/pipeline.cpp
index 95e3064548..a3f9859384 100644
--- a/src/cpp/src/visual_language/pipeline.cpp
+++ b/src/cpp/src/visual_language/pipeline.cpp
@@ -165,6 +165,8 @@ class ov::genai::VLMPipeline::VLMPipelineImpl {
             generation_config.set_eos_token_id(m_generation_config.eos_token_id);
         generation_config.validate();
 
+        m_inputs_embedder->set_apply_chat_template_status(generation_config.apply_chat_template);
+
         auto start_get_inputs_embeds = std::chrono::steady_clock::now();
         ov::Tensor inputs_embeds = m_inputs_embedder->get_inputs_embeds(prompt, rgbs, perf_metrics);
         auto end_get_inputs_embeds = std::chrono::steady_clock::now();
diff --git a/src/cpp/src/whisper_generation_config.cpp b/src/cpp/src/whisper_generation_config.cpp
index ec12170cf9..64bcd3e359 100644
--- a/src/cpp/src/whisper_generation_config.cpp
+++ b/src/cpp/src/whisper_generation_config.cpp
@@ -14,6 +14,10 @@
 namespace ov {
 namespace genai {
 
+WhisperGenerationConfig::WhisperGenerationConfig() {
+    apply_chat_template = false;
+}
+
 WhisperGenerationConfig::WhisperGenerationConfig(const std::filesystem::path& json_path)
     : GenerationConfig::GenerationConfig(json_path) {
     using ov::genai::utils::read_json_param;
@@ -38,6 +42,8 @@ WhisperGenerationConfig::WhisperGenerationConfig(const std::filesystem::path& js
     }
 
     read_json_param(data, "lang_to_id", lang_to_id);
+
+    apply_chat_template = false;
 }
 
 void WhisperGenerationConfig::update_generation_config(const ov::AnyMap& config_map) {
diff --git a/src/python/openvino_genai/py_openvino_genai.pyi b/src/python/openvino_genai/py_openvino_genai.pyi
index f1898d1232..62f3fb6060 100644
--- a/src/python/openvino_genai/py_openvino_genai.pyi
+++ b/src/python/openvino_genai/py_openvino_genai.pyi
@@ -550,6 +550,7 @@ class GenerationConfig:
         echo:           if set to true, the model will echo the prompt in the output.
         logprobs:       number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned.
                         Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0).
+        apply_chat_template: whether to apply chat_template for non-chat scenarios
     
         repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty.
         presence_penalty: reduces absolute log prob if the token was generated at least once.
@@ -578,6 +579,7 @@ class GenerationConfig:
         num_return_sequences: the number of sequences to generate from a single prompt.
     """
     adapters: AdapterConfig | None
+    apply_chat_template: bool
     assistant_confidence_threshold: float
     diversity_penalty: float
     do_sample: bool
@@ -996,6 +998,7 @@ class LLMPipeline:
             echo:           if set to true, the model will echo the prompt in the output.
             logprobs:       number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned.
                             Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0).
+            apply_chat_template: whether to apply chat_template for non-chat scenarios
         
             repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty.
             presence_penalty: reduces absolute log prob if the token was generated at least once.
@@ -1081,6 +1084,7 @@ class LLMPipeline:
             echo:           if set to true, the model will echo the prompt in the output.
             logprobs:       number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned.
                             Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0).
+            apply_chat_template: whether to apply chat_template for non-chat scenarios
         
             repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty.
             presence_penalty: reduces absolute log prob if the token was generated at least once.
@@ -1653,6 +1657,7 @@ class Tokenizer:
     openvino_genai.Tokenizer object is used to initialize Tokenizer
                if it's located in a different path than the main model.
     """
+    chat_template: str
     def __init__(self, tokenizer_path: os.PathLike, properties: dict[str, typing.Any] = {}, **kwargs) -> None:
         ...
     def apply_chat_template(self, history: list[dict[str, str]], add_generation_prompt: bool, chat_template: str = '') -> str:
diff --git a/src/python/py_generation_config.cpp b/src/python/py_generation_config.cpp
index e2a6d7062c..a2c77589db 100644
--- a/src/python/py_generation_config.cpp
+++ b/src/python/py_generation_config.cpp
@@ -47,6 +47,7 @@ char generation_config_docstring[] = R"(
     echo:           if set to true, the model will echo the prompt in the output.
     logprobs:       number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned.
                     Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0).
+    apply_chat_template: whether to apply chat_template for non-chat scenarios
 
     repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty.
     presence_penalty: reduces absolute log prob if the token was generated at least once.
@@ -115,6 +116,7 @@ void init_generation_config(py::module_& m) {
         .def_readwrite("include_stop_str_in_output", &GenerationConfig::include_stop_str_in_output)
         .def_readwrite("stop_token_ids", &GenerationConfig::stop_token_ids)
         .def_readwrite("adapters", &GenerationConfig::adapters)
+        .def_readwrite("apply_chat_template", &GenerationConfig::apply_chat_template)
         .def("set_eos_token_id", &GenerationConfig::set_eos_token_id, py::arg("tokenizer_eos_token_id"))
         .def("is_beam_search", &GenerationConfig::is_beam_search)
         .def("is_greedy_decoding", &GenerationConfig::is_greedy_decoding)
diff --git a/src/python/py_tokenizer.cpp b/src/python/py_tokenizer.cpp
index 0dd9f3d715..5d8640b9d5 100644
--- a/src/python/py_tokenizer.cpp
+++ b/src/python/py_tokenizer.cpp
@@ -109,6 +109,12 @@ void init_tokenizer(py::module_& m) {
             "Override a chat_template read from tokenizer_config.json."
         )
 
+        .def_property(
+            "chat_template",
+            &Tokenizer::get_chat_template,
+            &Tokenizer::set_chat_template
+        )
+
         .def("get_pad_token_id", &Tokenizer::get_pad_token_id)
         .def("get_bos_token_id", &Tokenizer::get_bos_token_id)
         .def("get_eos_token_id", &Tokenizer::get_eos_token_id)
diff --git a/tests/python_tests/common.py b/tests/python_tests/common.py
index b0b6a70e93..320f1e1a6a 100644
--- a/tests/python_tests/common.py
+++ b/tests/python_tests/common.py
@@ -252,7 +252,12 @@ def run_hugging_face(
         # process prompt by promp as we have multiple generation configs
         for prompt, generation_config in zip(prompts, generation_configs):
             hf_generation_config = convert_to_hf(opt_model.generation_config, generation_config)
-            inputs = hf_tokenizer(prompt, return_tensors="pt")
+            inputs = {}
+            if hf_tokenizer.chat_template and generation_config.apply_chat_template:
+                prompt = hf_tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True)
+                inputs = hf_tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
+            else:
+                inputs = hf_tokenizer(prompt, return_tensors="pt")
             input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
             prompt_len = 0 if generation_config.echo else input_ids.numel()
 
@@ -266,8 +271,15 @@ def run_hugging_face(
                 generation_result.m_scores = [score for score in generate_outputs.sequences_scores]
             generation_results.append(generation_result)
     else:
-        # process all prompts as a single batch as we have a single generation config for all prompts
-        inputs = hf_tokenizer(prompts, return_tensors='pt', padding=True, truncation=True, add_special_tokens=True, padding_side='left')
+        inputs = {}
+        if hf_tokenizer.chat_template and generation_configs.apply_chat_template:
+            processed_prompts = []
+            for prompt in prompts:
+                processed_prompts.append(hf_tokenizer.apply_chat_template([{'role': 'user', 'content': prompt}], tokenize=False, add_generation_prompt=True))
+            # process all prompts as a single batch as we have a single generation config for all prompts
+            inputs = hf_tokenizer(processed_prompts, return_tensors='pt', padding=True, truncation=True, add_special_tokens=False, padding_side='left')
+        else:
+            inputs = hf_tokenizer(prompts, return_tensors='pt', padding=True, truncation=True, padding_side='left')
         input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
         hf_generation_config = convert_to_hf(opt_model.generation_config, generation_configs)
         hf_encoded_outputs = opt_model.generate(input_ids, attention_mask=attention_mask, generation_config=hf_generation_config, tokenizer=hf_tokenizer)
diff --git a/tests/python_tests/test_generation_config.py b/tests/python_tests/test_generation_config.py
index 72da672713..c204ac7ecf 100644
--- a/tests/python_tests/test_generation_config.py
+++ b/tests/python_tests/test_generation_config.py
@@ -58,6 +58,8 @@ def verify_set_values(generation_config, kwargs):
     dict(max_new_tokens=1, assistant_confidence_threshold=0.5),
     dict(max_new_tokens=1, num_assistant_tokens=2),
     dict(max_new_tokens=1, num_assistant_tokens=2, max_ngram_size=2), # prompt lookup
+    dict(max_new_tokens=1, apply_chat_template=True),
+    dict(max_new_tokens=1, apply_chat_template=False),
 ]
 @pytest.mark.parametrize("generation_config_kwargs", configs)
 @pytest.mark.precommit
diff --git a/tests/python_tests/test_llm_pipeline.py b/tests/python_tests/test_llm_pipeline.py
index 8968f2a083..276aff7251 100644
--- a/tests/python_tests/test_llm_pipeline.py
+++ b/tests/python_tests/test_llm_pipeline.py
@@ -26,7 +26,7 @@
 
 test_cases = [
     (dict(max_new_tokens=20), '你好！ 你好嗎？'),
-    (dict(max_new_tokens=30, num_beams=15, num_beam_groups=3, num_return_sequences=15, diversity_penalty=1.0), 'Alan Turing was a'),
+    (dict(max_new_tokens=30, num_beams=15, num_beam_groups=3, num_return_sequences=15, diversity_penalty=1.0), 'Why is the Sun yellow?'),
 ]
 @pytest.mark.parametrize("generation_config_dict,prompt", test_cases)
 @pytest.mark.parametrize("model_descr", get_models_list())
@@ -339,7 +339,7 @@ def test_unicode_pybind_decoding_one_string():
     # Test that pybind will not fail.
     model_id, path = 'katuni4ka/tiny-random-phi3', Path('tiny-random-phi3')
     ov_pipe = read_model((model_id, path))[4]
-    res_str = ov_pipe.generate(',', max_new_tokens=4)
+    res_str = ov_pipe.generate(',', max_new_tokens=4, apply_chat_template=False)
     assert '�' == res_str[-1]
 
 
@@ -350,7 +350,7 @@ def test_unicode_pybind_decoding_batched():
     # Test that pybind will not fail.
     model_id, path = 'katuni4ka/tiny-random-phi3', Path('tiny-random-phi3')
     ov_pipe = read_model((model_id, path))[4]
-    res_str = ov_pipe.generate([","], max_new_tokens=4)
+    res_str = ov_pipe.generate([","], max_new_tokens=4, apply_chat_template=False)
     assert '�' == res_str.texts[0][-1]
 
 
@@ -362,7 +362,7 @@ def test_unicode_pybind_decoding_one_string_streamer():
     model_id, path = 'katuni4ka/tiny-random-phi3', Path('tiny-random-phi3')
     ov_pipe = read_model((model_id, path))[4]
     res_str = []
-    ov_pipe.generate(",", max_new_tokens=4, streamer=lambda x: res_str.append(x))
+    ov_pipe.generate(",", max_new_tokens=4, apply_chat_template=False, streamer=lambda x: res_str.append(x))
     assert '�' == ''.join(res_str)[-1]
 
 #
diff --git a/tests/python_tests/test_sampling.py b/tests/python_tests/test_sampling.py
index 7a3aced29a..28b2afd42a 100644
--- a/tests/python_tests/test_sampling.py
+++ b/tests/python_tests/test_sampling.py
@@ -18,7 +18,7 @@
                           (dict(max_new_tokens=30, min_new_tokens=30), '你好！ 你好嗎？'),
                           (dict(max_new_tokens=30, ignore_eos=True), 'Alan Turing was a'),
                         #   (dict(max_length=40), 'table is made of'),
-                          (dict(stop_token_ids={28998}), 'The Sun is yellow because'), # since a test does not hang, it means stop token is met
+                          (dict(stop_token_ids={28998}, apply_chat_template=False), 'The Sun is yellow because'), # since a test does not hang, it means stop token is met, skip chat template to generate long answer
                         #   (dict(max_new_tokens=1, min_new_tokens=0, echo=True), 'What is OpenVINO?')
                           ],
                          ids=["max_new_tokens",
@@ -59,7 +59,7 @@ def test_stop_strings(tmp_path, generation_config):
 @pytest.mark.parametrize("generation_config",
                          [dict(max_new_tokens=30),
                           dict(max_new_tokens=30, repetition_penalty=2.0),
-                          dict(max_new_tokens=300)],
+                          dict(max_new_tokens=300, apply_chat_template=False)],
                          ids=["basic", "repetition_penalty", "long_max_new_tokens"])
 @pytest.mark.parametrize("prompt", [
     'What is OpenVINO?',
diff --git a/tools/llm_bench/task/text_generation.py b/tools/llm_bench/task/text_generation.py
index 76f5678dd9..7b123cc7b3 100644
--- a/tools/llm_bench/task/text_generation.py
+++ b/tools/llm_bench/task/text_generation.py
@@ -234,6 +234,7 @@ def run_text_generation_genai(input_text, num, model, tokenizer, args, iter_data
     gen_config.rng_seed = args["seed"]
     gen_config.num_beams = args["num_beams"]
     gen_config.do_sample = False
+    gen_config.apply_chat_template = False
     if args.get('draft_model', ''):
         config_info = "Speculative decoding config: "
         if args.get('num_assistant_tokens', None):
@@ -381,6 +382,7 @@ def run_text_generation_genai_with_stream(input_text, num, model, tokenizer, arg
     gen_config.num_beams = args["num_beams"]
     gen_config.do_sample = False
     gen_config.ignore_eos = True
+    gen_config.apply_chat_template = False
     enable_prompt_permutations = not args.get("disable_prompt_permutation", False)
     if enable_prompt_permutations:
         log.warning(
diff --git a/tools/llm_bench/task/visual_language_generation.py b/tools/llm_bench/task/visual_language_generation.py
index a02b16b2bb..9cc6702999 100644
--- a/tools/llm_bench/task/visual_language_generation.py
+++ b/tools/llm_bench/task/visual_language_generation.py
@@ -211,6 +211,7 @@ def run_visual_language_generation_genai(
     gen_config.max_new_tokens = max_gen_tokens
     gen_config.num_beams = args["num_beams"]
     gen_config.do_sample = False
+    gen_config.apply_chat_template = False
     kwargs = {}
     if len(images) >= 1:
         kwargs["images"] = images[0]
diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py
index 1eb778a060..408442a3d9 100644
--- a/tools/who_what_benchmark/whowhatbench/wwb.py
+++ b/tools/who_what_benchmark/whowhatbench/wwb.py
@@ -267,7 +267,7 @@ def genai_gen_text(model, tokenizer, question, max_new_tokens, skip_question, us
         model.finish_chat()
         return result
     else:
-        return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens)
+        return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens, apply_chat_template=False)
 
 
 def llamacpp_gen_text(model, tokenizer, question, max_new_tokens, skip_question, use_chat_template=False):
@@ -336,6 +336,7 @@ def genai_gen_visual_text(model, prompt, image, processor, tokenizer, max_new_to
     config = model.get_generation_config()
     config.max_new_tokens = max_new_tokens
     config.do_sample = False
+    config.apply_chat_template = False
     model.set_generation_config(config)
 
     model.start_chat()

From 6c3ecf9c65565505798be444c98dc1514552e6bc Mon Sep 17 00:00:00 2001
From: Vladimir Zlobin <vladimir.zlobin@intel.com>
Date: Wed, 29 Jan 2025 19:35:52 +0400
Subject: [PATCH 07/15] beam_search_causal_lm.cpp: delete wrong comment (#1639)

---
 samples/cpp/text_generation/beam_search_causal_lm.cpp | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/samples/cpp/text_generation/beam_search_causal_lm.cpp b/samples/cpp/text_generation/beam_search_causal_lm.cpp
index 9e1ee069ad..2f50100ac6 100644
--- a/samples/cpp/text_generation/beam_search_causal_lm.cpp
+++ b/samples/cpp/text_generation/beam_search_causal_lm.cpp
@@ -19,9 +19,7 @@ int main(int argc, char* argv[]) try {
     config.num_beams = 15;
     config.diversity_penalty = 1.0f;
     config.num_return_sequences = config.num_beams;
-       
-    // Since the streamer is set, the results will
-    // be printed each time a new token is generated.
+
     auto beams = pipe.generate(prompts, config);
     std::cout << beams << '\n';
 } catch (const std::exception& error) {

From ec50b5b68baf2736da7f36e2543ec68667b4e064 Mon Sep 17 00:00:00 2001
From: Alexander Kozlov <kozzzloff@list.ru>
Date: Wed, 29 Jan 2025 20:19:21 +0400
Subject: [PATCH 08/15] [WWB]: Fixed nano-Llava preprocessor selection (#1646)

Partially fixes WWB flow for nano-Llava. It works for Optimum inference
but requires additional changes on the Optimum side to support HF
Transformers.
---
 tools/who_what_benchmark/whowhatbench/wwb.py | 24 +++++++++++---------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py
index 408442a3d9..ce88ef1fab 100644
--- a/tools/who_what_benchmark/whowhatbench/wwb.py
+++ b/tools/who_what_benchmark/whowhatbench/wwb.py
@@ -4,7 +4,7 @@
 import logging
 import os
 
-from transformers import AutoTokenizer, AutoProcessor
+from transformers import AutoTokenizer, AutoProcessor, AutoConfig
 import openvino as ov
 
 import pandas as pd
@@ -220,17 +220,19 @@ def load_tokenizer(args):
 
 
 def load_processor(args):
-    processor = None
-    if args.base_model is not None:
-        processor = AutoProcessor.from_pretrained(
-            args.base_model, trust_remote_code=True
-        )
-    elif args.target_model is not None:
-        processor = AutoProcessor.from_pretrained(
-            args.target_model, trust_remote_code=True
-        )
+    model_id = args.base_model if args.base_model is not None else args.target_model
+    if model_id is None:
+        return None
+
+    config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
+    if "llava-qwen" in config.model_type:
+        preprocessor_id = config.mm_vision_tower
+    else:
+        preprocessor_id = model_id
 
-    return processor
+    return AutoProcessor.from_pretrained(
+        preprocessor_id, trust_remote_code=True
+    )
 
 
 def diff_strings(a: str, b: str, *, use_loguru_colors: bool = False) -> str:

From 624eb00c24383c6db5156560b4d9bc80faf1fce1 Mon Sep 17 00:00:00 2001
From: Alexander Kozlov <kozzzloff@list.ru>
Date: Thu, 30 Jan 2025 10:37:54 +0400
Subject: [PATCH 09/15] [WWB]: Added config to preprocessor call in VLMs
 (#1638)

---
 tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py b/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py
index f0989e9041..6e259bf409 100644
--- a/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py
+++ b/tools/who_what_benchmark/whowhatbench/visualtext_evaluator.py
@@ -118,7 +118,7 @@ def default_gen_answer(
             preprocess_inputs = MODEL_TYPE_TO_CLS_MAPPING[
                 model.config.model_type
             ].preprocess_inputs
-            inputs = preprocess_inputs(prompt, image, processor, tokenizer)
+            inputs = preprocess_inputs(prompt, image, processor, tokenizer, config=model.config)
             tokens = model.generate(
                 **inputs,
                 do_sample=False,

From 0c7ce5e0603cc9a98c785f605d369535695c39e5 Mon Sep 17 00:00:00 2001
From: Ilya Lavrenov <ilya.lavrenov@intel.com>
Date: Thu, 30 Jan 2025 11:18:13 +0400
Subject: [PATCH 10/15] CB: remove DeviceConfig class (#1640)

---
 .github/labeler.yml                           |  1 -
 src/cpp/src/cache_manager.hpp                 | 81 ++++++++++++++---
 src/cpp/src/continuous_batching_impl.cpp      | 46 +++++-----
 src/cpp/src/continuous_batching_impl.hpp      |  4 +-
 src/cpp/src/device_config.hpp                 | 91 -------------------
 .../models/autoencoder_kl.cpp                 |  2 -
 .../src/paged_attention_transformations.cpp   | 33 ++-----
 .../src/paged_attention_transformations.hpp   | 22 +++--
 src/cpp/src/scheduler.hpp                     | 13 ++-
 ...batching_for_speculative_decoding_impl.cpp |  4 +-
 ...batching_for_speculative_decoding_impl.hpp |  2 +-
 .../speculative_decoding_impl.cpp             | 27 +++---
 tests/cpp/cache_manager.cpp                   | 26 ++----
 tests/cpp/scheduler.cpp                       |  4 +-
 14 files changed, 145 insertions(+), 211 deletions(-)
 delete mode 100644 src/cpp/src/device_config.hpp

diff --git a/.github/labeler.yml b/.github/labeler.yml
index a75abd795c..dbc319c29a 100644
--- a/.github/labeler.yml
+++ b/.github/labeler.yml
@@ -99,7 +99,6 @@
 - 'src/cpp/src/continuous_batching_impl.cpp'
 - 'src/cpp/src/continuous_batching_pipeline.cpp'
 - 'src/cpp/src/debug_utils.hpp'
-- 'src/cpp/src/device_config.hpp'
 - 'src/cpp/src/generation_handle.cpp'
 - 'src/cpp/src/generation_stream.hpp'
 - 'src/cpp/src/model_runner.hpp'
diff --git a/src/cpp/src/cache_manager.hpp b/src/cpp/src/cache_manager.hpp
index 255bb926be..cc89497625 100644
--- a/src/cpp/src/cache_manager.hpp
+++ b/src/cpp/src/cache_manager.hpp
@@ -5,12 +5,12 @@
 
 #include <vector>
 #include <list>
+
 #include "openvino/runtime/tensor.hpp"
-#include "device_config.hpp"
+#include "paged_attention_transformations.hpp"
 
 #ifndef _WIN32
 #include <sys/mman.h>
-#include "openvino/core/shape.hpp"
 
 
 class TensorMmapAllocator { 
@@ -49,11 +49,13 @@ namespace ov::genai {
 class CacheManager {
     size_t m_num_decoder_layers = 0;
     std::string m_device;
+    size_t m_block_size = 0; // block size is per inference device 
     std::vector<ov::element::Type> m_key_precisions, m_value_precisions;
     std::vector<ov::PartialShape> m_key_shapes, m_value_shapes;
     std::vector<ov::Tensor> m_key_cache, m_value_cache;
     size_t m_num_allocated_kv_blocks = 0, m_block_size_in_bytes = 0;
     ov::InferRequest m_request;
+    size_t m_k_head_size = 0;
 
     static ov::Shape set_kv_blocks(ov::PartialShape pshape, size_t num_kv_blocks) {
         pshape[0] = num_kv_blocks;
@@ -65,47 +67,88 @@ class CacheManager {
         m_request.set_tensor(std::string("value_cache.") + std::to_string(decoder_layer_id), m_value_cache[decoder_layer_id]);
     }
 
-    ov::PartialShape patch_shape(ov::PartialShape pshape, ov::element::Type cache_type) {
+    ov::PartialShape to_partial_shape(const KVHeadConfig& config, ov::element::Type cache_type, bool key_param) {
         OPENVINO_ASSERT(!m_device.empty(), "Internal error: device is not set");
+        OPENVINO_ASSERT(m_block_size > 0, "Internal error: block size is not set yet");
+
+        ov::PartialShape pshape;
+
+        if (m_device.find("CPU") != std::string::npos) {
+            if (key_param) {
+                pshape = ov::PartialShape{ov::Dimension::dynamic(),
+                                          ov::Dimension(config.num_k_heads),
+                                          ov::Dimension(m_block_size),
+                                          ov::Dimension(config.k_head_size)};
 
-        if (m_device.find("CPU") != std::string::npos && cache_type == ov::element::u8) {
-            // Scale, zero point and quantized data will be stored together.
-            // The layout for per token per head:
-            // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)|
-            // so, we have to extend head_size by 8, which is sizeof(float)
-            // for scale and sizeof(float) for zeropoint
-            pshape[3] += 2 * sizeof(float);
+                if (m_k_head_size == 0) {
+                    m_k_head_size = config.k_head_size;
+                }
+            } else {
+                pshape = ov::PartialShape{ov::Dimension::dynamic(),
+                                          ov::Dimension(config.num_v_heads),
+                                          ov::Dimension(m_block_size),
+                                          ov::Dimension(config.v_head_size)};
+            }
+
+            if (cache_type == ov::element::u8) {
+                // Scale, zero point and quantized data will be stored together.
+                // The layout for per token per head:
+                // |scale(f32)|zeropoint(f32)|quantized data(u8,idx_1)|quantized data(u8,idx_2)|...|quantized data(u8,idx_head_size)|
+                // so, we have to extend head_size by 8, which is sizeof(float)
+                // for scale and sizeof(float) for zeropoint
+                pshape[3] += 2 * sizeof(float);
+            }
+        } else if (m_device.find("GPU") != std::string::npos) {
+            if (key_param) {
+                pshape = ov::PartialShape{ov::Dimension::dynamic(),
+                                          ov::Dimension(config.num_k_heads),
+                                          ov::Dimension(config.k_head_size),
+                                          ov::Dimension(m_block_size)};
+            } else {
+                pshape = ov::PartialShape{ov::Dimension::dynamic(),
+                                          ov::Dimension(config.num_v_heads),
+                                          ov::Dimension(m_block_size),
+                                          ov::Dimension(config.v_head_size)};
+            }
+        } else {
+            OPENVINO_THROW("Internal error: unsupported device ", m_device);
         }
 
         return pshape;
     }
 
 public:
-    CacheManager(ov::InferRequest request, const DeviceConfig& device_config) :
+    CacheManager(ov::InferRequest request, const std::vector<KVHeadConfig>& kv_cache_config) :
         m_request(request) {
         // extract information about inference device
         ov::CompiledModel compiled_model = request.get_compiled_model();
         std::vector<std::string> execution_devices = compiled_model.get_property(ov::execution_devices);
         OPENVINO_ASSERT(execution_devices.size() == 1, "Contituous batching: execution device is expected to be CPU or GPU, but got ", execution_devices.size(), " devices");
         m_device = execution_devices[0];
+        
+        // set block_size depending on device
+        const size_t cpu_block_size = 32, gpu_block_size = 16;
+        const bool is_gpu = m_device.find("GPU") != std::string::npos;
+        m_block_size = is_gpu ? gpu_block_size : cpu_block_size;
 
         // extract information about KV cache precisions and shapes
         size_t kv_input_index = 0;
         for (const auto& input : compiled_model.inputs()) {
             for (auto & name : input.get_names()) {
                 auto cache_precision = input.get_element_type();
+                ov::PartialShape pshape;
 
                 if (name.find("key_cache.") == 0) {
-                    auto pshape = patch_shape(device_config.get_key_cache_shape(kv_input_index), cache_precision);
+                    pshape = to_partial_shape(kv_cache_config[kv_input_index], cache_precision, true);
+                    m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();
                     m_key_shapes.push_back(pshape);
                     m_key_precisions.push_back(cache_precision);
-                    m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();
                     break;
                 } else if (name.find("value_cache.") == 0) {
-                    auto pshape = patch_shape(device_config.get_value_cache_shape(kv_input_index), cache_precision);
+                    pshape = to_partial_shape(kv_cache_config[kv_input_index], cache_precision, false);
+                    m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();
                     m_value_shapes.push_back(pshape);
                     m_value_precisions.push_back(cache_precision);
-                    m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();
                     ++kv_input_index;
                     break;
                 }
@@ -124,6 +167,10 @@ class CacheManager {
         return m_device;
     }
 
+    size_t get_block_size() const {
+        return m_block_size;
+    }
+
     ov::element::Type get_key_cache_precision(size_t decoder_layer_id) const {
         OPENVINO_ASSERT(decoder_layer_id < m_key_precisions.size());
         return m_key_precisions[decoder_layer_id];
@@ -251,6 +298,10 @@ class CacheManager {
         return m_value_cache[decoder_layer_id];
     }
 
+    size_t get_v_head_size(size_t layer_id) const {
+        return m_value_shapes[layer_id][3].get_length();
+    }
+
     void copy_blocks(const std::map<size_t, std::list<size_t>>& block_copy_map) {
         for (const auto & blocks_pair : block_copy_map) {
             size_t src_block_id = blocks_pair.first;
diff --git a/src/cpp/src/continuous_batching_impl.cpp b/src/cpp/src/continuous_batching_impl.cpp
index b4100f8aec..f95cd3b9c6 100644
--- a/src/cpp/src/continuous_batching_impl.cpp
+++ b/src/cpp/src/continuous_batching_impl.cpp
@@ -113,14 +113,12 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::ContinuousBatchingImpl(
     m_generation_config = generation_config;
     m_is_validation_mode_enabled = is_validation_mode_enabled;
 
-    DeviceConfig device_config(device);
-
     bool is_need_per_layer_cache_control = scheduler_config.use_cache_eviction;
     bool allow_cache_rotation = scheduler_config.cache_eviction_config.apply_rotation;
-    utils::apply_paged_attention_transformations(model, device_config, is_need_per_layer_cache_control, allow_cache_rotation);
+    auto kv_cache_config = utils::apply_paged_attention_transformations(model, is_need_per_layer_cache_control, allow_cache_rotation);
     utils::apply_gather_before_matmul_transformation(model);
 
-    initialize_pipeline(model, scheduler_config, properties, device_config);
+    initialize_pipeline(model, scheduler_config, device, properties, kv_cache_config);
 }
 
 ContinuousBatchingPipeline::ContinuousBatchingImpl::~ContinuousBatchingImpl() {
@@ -139,29 +137,31 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_pull_awaiting_requests
 void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
     std::shared_ptr<ov::Model> model,
     const SchedulerConfig& scheduler_config,
+    const std::string& device,
     const ov::AnyMap& properties,
-    const DeviceConfig& device_config) {
+    const std::vector<KVHeadConfig>& kv_cache_config) {
     ov::Core core = utils::singleton_core();
     ov::CompiledModel compiled_model;
 
     // TODO: remove once plugin automatically set KV cache precisions
-    apply_kv_cache_precision(model, device_config.get_device(), properties);
+    apply_kv_cache_precision(model, device, properties);
 
     // apply LoRA
     if (auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters)) {
         m_generation_config.adapters->set_tensor_name_prefix("base_model.model.model.");
-        m_adapter_controller = AdapterController(model, *m_generation_config.adapters, device_config.get_device());   // TODO: Make the prefix name configurable
-        compiled_model = core.compile_model(model, device_config.get_device(), *filtered_properties);
+        m_adapter_controller = AdapterController(model, *m_generation_config.adapters, device);   // TODO: Make the prefix name configurable
+        compiled_model = core.compile_model(model, device, *filtered_properties);
     } else {
-        compiled_model = core.compile_model(model, device_config.get_device(), properties);
+        compiled_model = core.compile_model(model, device, properties);
     }
 
     ov::genai::utils::print_compiled_model_properties(compiled_model, "LLM with Paged Attention");
     ov::InferRequest infer_request = compiled_model.create_infer_request();
 
     // Cache manager
-    std::shared_ptr<CacheManager> cache_manager = std::make_shared<CacheManager>(infer_request, device_config);
+    std::shared_ptr<CacheManager> cache_manager = std::make_shared<CacheManager>(infer_request, kv_cache_config);
     m_num_decoder_layers = cache_manager->get_num_decoder_layers();
+    m_block_size = cache_manager->get_block_size();
 
     // Scheduler
     SchedulerConfig normalized_config = scheduler_config;
@@ -171,13 +171,13 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
     }
 
     bool can_use_partial_preemption = true;
-    if (device_config.get_device().find("GPU") != std::string::npos && !normalized_config.dynamic_split_fuse) {
+    if (device.find("GPU") != std::string::npos && !normalized_config.dynamic_split_fuse) {
         // in case of executing a `vLLM-like` pipeline, it's better not to use partial eviction on the GPU,
         // as it may lead to performance slowdown
         can_use_partial_preemption = false;
     }
 
-    m_scheduler = std::make_shared<Scheduler>(device_config.get_block_size(), cache_manager, normalized_config, m_num_decoder_layers, can_use_partial_preemption);
+    m_scheduler = std::make_shared<Scheduler>(m_block_size, cache_manager, normalized_config, m_num_decoder_layers, can_use_partial_preemption);
 
     // Model Runner
     bool is_use_cache_eviction = m_scheduler->get_config().use_cache_eviction;
@@ -185,7 +185,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
         const auto& eviction_config = m_scheduler->get_config().cache_eviction_config;
         bool is_apply_rotation = eviction_config.apply_rotation;
         m_model_runner = std::make_shared<ModelRunner>(infer_request,
-                                                       m_scheduler->get_block_size(),
+                                                       m_block_size,
                                                        m_num_decoder_layers,
                                                        /* collect_attention_scores = */ true,
                                                        /* is_use_per_layer_cache_control = */ true,
@@ -199,10 +199,10 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
                 m_rotation_deltas_stores.push_back(store);
             }
 
-            size_t max_sequence_cache_occupation_length_in_blocks = normalized_config.max_num_batched_tokens / m_scheduler->get_block_size()  + 1;
-            size_t embedding_size = device_config.get_k_head_size(0);
+            size_t max_sequence_cache_occupation_length_in_blocks = normalized_config.max_num_batched_tokens / m_block_size  + 1;
+            size_t embedding_size = cache_manager->get_v_head_size(0);
             m_cache_rotation_calculator = std::make_shared<CacheRotationCalculator>(
-                m_scheduler->get_block_size(),
+                m_block_size,
                 max_sequence_cache_occupation_length_in_blocks,
                 embedding_size);
             auto rotation_trig_lut = ov::Tensor(ov::element::f32, ov::Shape{max_sequence_cache_occupation_length_in_blocks, embedding_size});
@@ -224,7 +224,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
         }
     } else {
         m_model_runner =
-            std::make_shared<ModelRunner>(infer_request, m_scheduler->get_block_size(), m_num_decoder_layers);
+            std::make_shared<ModelRunner>(infer_request, m_block_size, m_num_decoder_layers);
     }
 
     m_sampler = std::make_shared<Sampler>(m_tokenizer);
@@ -245,9 +245,7 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::add_request(uint64_t request
         sampling_params.set_eos_token_id(m_generation_config.eos_token_id);
     sampling_params.validate();
 
-    SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, input_ids,
-                                                                        sampling_params,
-                                                                        m_scheduler->get_block_size());
+    SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, input_ids, sampling_params, m_block_size);
 
     if (m_scheduler->get_config().enable_prefix_caching) {
         m_scheduler->restore_cached_blocks(sequence_group);
@@ -662,8 +660,8 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_compute_cache_rotation
                 size_t block_offset = num_blocks_to_rotate_for_each_layer[layer_idx];
                 auto rotation_deltas_tensor_data =
                     m_rotation_deltas_stores[layer_idx].data<int32_t>() + block_offset;
-                for (size_t tok_idx = 0; tok_idx < m_scheduler->get_block_size(); tok_idx++) {
-                   rotation_deltas_tensor_data[tok_idx] = block_rotation_data.rotation_delta / m_scheduler->get_block_size();
+                for (size_t tok_idx = 0; tok_idx < m_block_size; tok_idx++) {
+                   rotation_deltas_tensor_data[tok_idx] = block_rotation_data.rotation_delta / m_block_size;
                 }
                 num_blocks_to_rotate_for_each_layer[layer_idx] += 1;
             }
@@ -693,7 +691,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_maybe_evict_cache_bloc
         auto seq_id = seq_id_and_attention_scores.first;
         const auto& attention_scores_for_all_decoder_layers = seq_id_and_attention_scores.second;
         if (m_seq_group_id_to_cache_eviction_algo_map.find(seq_id) == m_seq_group_id_to_cache_eviction_algo_map.end()) {
-            m_seq_group_id_to_cache_eviction_algo_map[seq_id] = CacheEvictionAlgorithm(sched_config.cache_eviction_config, m_scheduler->get_block_size(), num_decoder_layers);
+            m_seq_group_id_to_cache_eviction_algo_map[seq_id] = CacheEvictionAlgorithm(sched_config.cache_eviction_config, m_block_size, num_decoder_layers);
         }
         auto& cache_eviction_algo = m_seq_group_id_to_cache_eviction_algo_map[seq_id];
         cache_eviction_algo.register_new_token_scores(attention_scores_for_all_decoder_layers);
@@ -728,7 +726,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::_maybe_evict_cache_bloc
         // Assuming that the evicted blocks are always full (since they by design are only selected from intermediate-age blocks)
         auto seq_group_ptr = seq_group_ptr_and_num_blocks_evicted.first;
         auto num_blocks_evicted = seq_group_ptr_and_num_blocks_evicted.second;
-        seq_group_ptr->register_token_eviction(num_blocks_evicted * m_scheduler->get_block_size());
+        seq_group_ptr->register_token_eviction(num_blocks_evicted * m_block_size);
     }
 }
 
diff --git a/src/cpp/src/continuous_batching_impl.hpp b/src/cpp/src/continuous_batching_impl.hpp
index 9fa6c9c660..1ee40ef73c 100644
--- a/src/cpp/src/continuous_batching_impl.hpp
+++ b/src/cpp/src/continuous_batching_impl.hpp
@@ -37,6 +37,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc
     bool m_is_validation_mode_enabled = false;
 
     size_t m_num_decoder_layers = 0;
+    size_t m_block_size = 0;
 
     // Pre-allocated per-layer storages for the per-token cache re-rotation deltas used in cache eviction case
     std::vector<ov::Tensor> m_rotation_deltas_stores;
@@ -58,8 +59,9 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc
 
     void initialize_pipeline(std::shared_ptr<ov::Model> model,
                              const SchedulerConfig& scheduler_config,
+                             const std::string& device,
                              const ov::AnyMap& plugin_config,
-                             const DeviceConfig& device_config);
+                             const std::vector<KVHeadConfig>& kv_cache_config);
 
     /**
      * Pulls requests from awaiting queue to running queue
diff --git a/src/cpp/src/device_config.hpp b/src/cpp/src/device_config.hpp
deleted file mode 100644
index 09020da9a8..0000000000
--- a/src/cpp/src/device_config.hpp
+++ /dev/null
@@ -1,91 +0,0 @@
-// Copyright (C) 2023-2025 Intel Corporation
-// SPDX-License-Identifier: Apache-2.0
-
-#pragma once
-
-#include "openvino/runtime/core.hpp"
-#include "openvino/core/shape.hpp"
-#include "openvino/core/type/element_type.hpp"
-
-#include "openvino/genai/scheduler_config.hpp"
-
-namespace ov::genai {
-
-/**
- * Per layer KV cache size configuration
- */
-struct KVHeadConfig {
-    size_t num_v_heads, num_k_heads;
-    size_t v_head_size, k_head_size;
-};
-
-class DeviceConfig {
-    std::vector<ov::PartialShape> m_key_cache_shape, m_value_cache_shape;
-    std::vector<KVHeadConfig> m_kv_heads_config;
-    size_t m_block_size = 0; // block size is per inference device 
-    std::string m_device;
-
-    size_t get_block_size_by_device(const std::string& device) const {
-        const size_t cpu_block_size = 32, gpu_block_size = 16;
-        const bool is_gpu = device.find("GPU") != std::string::npos;
-        return is_gpu ? gpu_block_size : cpu_block_size;
-    }
-
-public:
-    explicit DeviceConfig(const std::string& device) {
-        m_device = device;
-        m_block_size = get_block_size_by_device(device);
-    }
-
-    void set_kv_head_configs(const std::vector<KVHeadConfig>& kv_heads_config) {
-        m_kv_heads_config = kv_heads_config;
-        m_key_cache_shape.reserve(m_kv_heads_config.size());
-        m_value_cache_shape.reserve(m_kv_heads_config.size());
-
-        for (size_t layer_id = 0; layer_id < kv_heads_config.size(); layer_id++) {
-            const KVHeadConfig& config = m_kv_heads_config[layer_id];
-
-            m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
-                                                           ov::Dimension(config.num_v_heads),
-                                                           ov::Dimension(m_block_size),
-                                                           ov::Dimension(config.v_head_size)});
-
-            if (m_device.find("CPU") != std::string::npos) {
-                m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
-                                                             ov::Dimension(config.num_k_heads),
-                                                             ov::Dimension(m_block_size),
-                                                             ov::Dimension(config.k_head_size)});
-            } else if (m_device.find("GPU") != std::string::npos) {
-                // Update key shape, as the key's shape is different from the value's shape
-                m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
-                                                             ov::Dimension(config.num_k_heads),
-                                                             ov::Dimension(config.k_head_size),
-                                                             ov::Dimension(m_block_size)});
-            }
-        }
-    }
-
-    std::string get_device() const {
-        return m_device;
-    }
-
-    ov::PartialShape get_key_cache_shape(size_t id) const {
-        OPENVINO_ASSERT(m_key_cache_shape.size());
-        return m_key_cache_shape[id];
-    }
-
-    ov::PartialShape get_value_cache_shape(size_t id) const {
-        OPENVINO_ASSERT(m_value_cache_shape.size());
-        return m_value_cache_shape[id];
-    }
-
-    size_t get_k_head_size(size_t layer_id) const {
-        return m_kv_heads_config[layer_id].k_head_size;
-    }
-
-    size_t get_block_size() const {
-        return m_block_size;
-    }
-};
-
-}
diff --git a/src/cpp/src/image_generation/models/autoencoder_kl.cpp b/src/cpp/src/image_generation/models/autoencoder_kl.cpp
index bcec125375..e7357c3f36 100644
--- a/src/cpp/src/image_generation/models/autoencoder_kl.cpp
+++ b/src/cpp/src/image_generation/models/autoencoder_kl.cpp
@@ -68,8 +68,6 @@ class DiagonalGaussianDistribution {
 
 // for BW compatibility with 2024.6.0
 ov::AnyMap handle_scale_factor(std::shared_ptr<ov::Model> model, const std::string& device, ov::AnyMap properties) {
-    std::cout << ov::Any(properties).as<std::string>() << std::endl;
-
     auto it = properties.find("WA_INFERENCE_PRECISION_HINT");
     ov::element::Type wa_inference_precision = it != properties.end() ? it->second.as<ov::element::Type>() : ov::element::undefined;
     if (it != properties.end()) {
diff --git a/src/cpp/src/paged_attention_transformations.cpp b/src/cpp/src/paged_attention_transformations.cpp
index 6d337136dc..1c7ffd51d2 100644
--- a/src/cpp/src/paged_attention_transformations.cpp
+++ b/src/cpp/src/paged_attention_transformations.cpp
@@ -10,27 +10,14 @@ namespace ov {
 namespace genai {
 namespace utils {
 
-size_t get_hidden_size(const std::shared_ptr<ov::Model> model) {
-    const auto& parameters = model->get_parameters();
-    // extract num_kv_heads and head_size
-    size_t kv_caches_inputs_offset = 2;
-    ov::PartialShape k_shape = parameters[kv_caches_inputs_offset]->get_partial_shape();
-    OPENVINO_ASSERT(k_shape.rank().get_length() == 3, "KV cache shape is expected to have rank 3, while shape is ", k_shape);
-    size_t num_kv_heads = k_shape[1].get_length(), head_size = k_shape[2].get_length();
-    return num_kv_heads * head_size;
-}
-
-void apply_paged_attention_transformations(std::shared_ptr<ov::Model> model, bool per_layer_cache_control, bool allow_cache_rotation) {
+std::vector<KVHeadConfig> apply_paged_attention_transformations(std::shared_ptr<ov::Model> model, bool per_layer_cache_control, bool allow_cache_rotation) {
     const ov::op::util::VariableVector& variables = model->get_variables();
     OPENVINO_ASSERT(!variables.empty(), "Model is supposed to be stateful");
 
     bool use_block_indices_inputs = per_layer_cache_control;
     bool use_score_outputs = per_layer_cache_control;
-    ov::pass::SDPAToPagedAttention(use_block_indices_inputs, use_score_outputs, allow_cache_rotation)
-        .run_on_model(model);
-}
+    ov::pass::SDPAToPagedAttention(use_block_indices_inputs, use_score_outputs, allow_cache_rotation).run_on_model(model);
 
-void set_kv_cache_type_and_shape(std::shared_ptr<ov::Model> model, DeviceConfig& device_config) {
     std::map<std::string, std::shared_ptr<ov::op::v0::Parameter>> key_cache_params, value_cache_params;
     for (const auto& param_ptr : model->get_parameters()) {
         const auto& name = param_ptr->get_friendly_name();
@@ -44,10 +31,10 @@ void set_kv_cache_type_and_shape(std::shared_ptr<ov::Model> model, DeviceConfig&
     OPENVINO_ASSERT(key_cache_params.size() == value_cache_params.size() && key_cache_params.size() > 0);
 
     size_t num_decoder_layers = key_cache_params.size();
-    std::vector<KVHeadConfig> kv_heads_config(num_decoder_layers);
+    std::vector<KVHeadConfig> kv_cache_config(num_decoder_layers);
 
     for (size_t idx = 0; idx < num_decoder_layers; idx++) {
-        KVHeadConfig& config = kv_heads_config[idx];
+        KVHeadConfig& config = kv_cache_config[idx];
 
         auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)];
         auto key_shape = k->get_partial_shape();
@@ -60,10 +47,7 @@ void set_kv_cache_type_and_shape(std::shared_ptr<ov::Model> model, DeviceConfig&
         config.v_head_size = value_shape[2].get_length();
     }
 
-    // save information about KV caches in device_config
-    // and create device dependent KV cache shapes
-    device_config.set_kv_head_configs(kv_heads_config);
-
+    // reset information in KV cache parameters
     for (size_t idx = 0; idx < num_decoder_layers; idx++) {
         auto k = key_cache_params[std::string("key_cache.") + std::to_string(idx)];
         auto v = value_cache_params[std::string("value_cache.") + std::to_string(idx)];
@@ -72,17 +56,14 @@ void set_kv_cache_type_and_shape(std::shared_ptr<ov::Model> model, DeviceConfig&
         k->set_element_type(ov::element::dynamic);
         v->set_element_type(ov::element::dynamic);
 
-        // set device specific KV cache shapes back to a PA model
+        // order of dimensions within shapes are not required for plugin during compilation
         k->set_partial_shape(ov::PartialShape::dynamic(4));
         v->set_partial_shape(ov::PartialShape::dynamic(4));
     }
 
     model->validate_nodes_and_infer_types();
-}
 
-void apply_paged_attention_transformations(std::shared_ptr<ov::Model> model, DeviceConfig& device_config, bool per_layer_cache_control, bool allow_cache_rotation) {
-    apply_paged_attention_transformations(model, per_layer_cache_control, allow_cache_rotation);
-    set_kv_cache_type_and_shape(model, device_config);
+    return kv_cache_config;
 }
 
 }  // namespace utils
diff --git a/src/cpp/src/paged_attention_transformations.hpp b/src/cpp/src/paged_attention_transformations.hpp
index aa86db2657..66cc6d6bc1 100644
--- a/src/cpp/src/paged_attention_transformations.hpp
+++ b/src/cpp/src/paged_attention_transformations.hpp
@@ -3,14 +3,23 @@
 
 #pragma once
 
+#include <vector>
+
 #include "openvino/core/any.hpp"
 #include "openvino/core/model.hpp"
-#include "device_config.hpp"
 
 namespace ov {
 namespace genai {
-namespace utils {
 
+/**
+ * Per layer KV cache size configuration
+ */
+struct KVHeadConfig {
+    size_t num_v_heads, num_k_heads;
+    size_t v_head_size, k_head_size;
+};
+
+namespace utils {
 
 /** Applies transformations to the ov::Model to enable paged attention inference.
  * @param model Pointer to the ov::Model representing one of the supported LLM architectures.
@@ -18,14 +27,9 @@ namespace utils {
  * @param per_layer_cache_control If true, then the transformations will enable per-layer control of KV cache blocks, allowing to specify
  * different sets of KV cache blocks for different attention layers. If false, then the KV cache block structure will be identical across all
  * decoder layers.
+ * @return Information about each decoder layer configuration 
  */
-void apply_paged_attention_transformations(std::shared_ptr<ov::Model> model, DeviceConfig& device_config, bool per_layer_cache_control = false, bool allow_cache_rotation = false);
-
-void apply_paged_attention_transformations(std::shared_ptr<ov::Model> model, bool per_layer_cache_control = false, bool allow_cache_rotation = false);
-
-size_t get_hidden_size(const std::shared_ptr<ov::Model> model);
-
-void set_kv_cache_type_and_shape(std::shared_ptr<ov::Model> model, DeviceConfig& device_config);
+std::vector<KVHeadConfig> apply_paged_attention_transformations(std::shared_ptr<ov::Model> model, bool per_layer_cache_control = false, bool allow_cache_rotation = false);
 
 void apply_gather_before_matmul_transformation(std::shared_ptr<ov::Model> model);
 
diff --git a/src/cpp/src/scheduler.hpp b/src/cpp/src/scheduler.hpp
index 23db68deab..160734b520 100644
--- a/src/cpp/src/scheduler.hpp
+++ b/src/cpp/src/scheduler.hpp
@@ -9,7 +9,6 @@
 
 #include "openvino/runtime/intel_gpu/properties.hpp"
 #include "openvino/genai/scheduler_config.hpp"
-#include "device_config.hpp"
 #include "block_manager.hpp"
 #include "sequence_group.hpp"
 #include "cache_manager.hpp"
@@ -45,10 +44,10 @@ class Scheduler {
         float m_cache_usage = 0.0;
     };
 
-    explicit Scheduler(size_t block_size, std::shared_ptr<CacheManager> cache_manager, const SchedulerConfig & config = {}, size_t num_layers = 1, bool can_use_partial_preemption = true) :
-            m_cache_manager(cache_manager),
-            m_can_use_partial_preemption(can_use_partial_preemption),
-            m_config(config) {
+    Scheduler(size_t block_size, std::shared_ptr<CacheManager> cache_manager, const SchedulerConfig & config = {}, size_t num_layers = 1, bool can_use_partial_preemption = true) :
+        m_cache_manager(cache_manager),
+        m_can_use_partial_preemption(can_use_partial_preemption),
+        m_config(config) {
         m_block_manager = std::make_shared<BlockManager>(m_config.num_kv_blocks, m_config.enable_prefix_caching, block_size, num_layers);
         OPENVINO_ASSERT(num_layers != 0, "num_layers must be non-zero");
     }
@@ -499,13 +498,13 @@ class Scheduler {
             auto seq_length = sequence_groups[idx]->get_prompt_len() * m_kv_blocks_initial_multiplier;
             auto gen_config = sequence_groups[idx]->get_sampling_parameters();
             seq_length = std::min(seq_length, sequence_groups[idx]->get_prompt_len() + sequence_groups[idx]->get_max_new_tokens());
-            size_t blocks_num = std::ceil((float)seq_length / m_block_manager->get_block_size());
+            size_t blocks_num = std::ceil(static_cast<float>(seq_length) / m_block_manager->get_block_size());
             if (gen_config.is_beam_search()) {
                 blocks_num *= gen_config.num_beams;
             } else if (gen_config.is_multinomial()) {
                 blocks_num *= gen_config.num_return_sequences;
             }
-            blocks_sum  += blocks_num;
+            blocks_sum += blocks_num;
         }
         m_block_manager->increase_kv_blocks_number(blocks_sum);
         m_dynamic_memory_allocation = true;
diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
index 2ecdbd66f3..14dfaae60f 100644
--- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
+++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
@@ -8,7 +8,7 @@ ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::Contin
     const std::shared_ptr<ov::Model>& model,
     const Tokenizer& tokenizer,
     const GenerationConfig& generation_config,
-    const DeviceConfig& device_config,
+    const std::vector<KVHeadConfig>& kv_cache_configs,
     const SchedulerConfig& scheduler_config,
     const std::string& device,
     const ov::AnyMap& plugin_config,
@@ -16,7 +16,7 @@ ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl::Contin
     m_tokenizer = tokenizer;
     m_generation_config = generation_config;
     m_is_validation_mode_enabled = is_validation_mode_enabled;
-    initialize_pipeline(model, scheduler_config, plugin_config, device_config);
+    initialize_pipeline(model, scheduler_config, device, plugin_config, kv_cache_configs);
 }
 
 void
diff --git a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp
index b714316e75..68cc0e45c4 100644
--- a/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp
+++ b/src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp
@@ -16,7 +16,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingForSpeculativeDecodingImpl :
     ContinuousBatchingForSpeculativeDecodingImpl(const std::shared_ptr<ov::Model>& model,
                                                  const Tokenizer& tokenizer,
                                                  const GenerationConfig& generation_config,
-                                                 const DeviceConfig& device_config,
+                                                 const std::vector<KVHeadConfig>& kv_cache_configs,
                                                  const SchedulerConfig& scheduler_config,
                                                  const std::string& device,
                                                  const ov::AnyMap& plugin_config,
diff --git a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp
index 32d13feed1..51490945e7 100644
--- a/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp
+++ b/src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp
@@ -33,8 +33,8 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con
     auto main_scheduler_config = main_model_desc.scheduler_config;
     auto main_device = main_model_desc.device;
 
-    utils::apply_paged_attention_transformations(main_model, main_model_desc.scheduler_config.use_cache_eviction);
-    utils::apply_paged_attention_transformations(draft_model, main_model_desc.scheduler_config.use_cache_eviction);
+    auto main_kv_cache_config = utils::apply_paged_attention_transformations(main_model, main_model_desc.scheduler_config.use_cache_eviction);
+    auto draft_kv_cache_config = utils::apply_paged_attention_transformations(draft_model, main_model_desc.scheduler_config.use_cache_eviction);
 
     utils::apply_gather_before_matmul_transformation(main_model);
     utils::apply_gather_before_matmul_transformation(draft_model);
@@ -47,10 +47,18 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con
 
     if (is_draft_scheduler_undefined) {
         // split KV cache to 2 caches for main and draft models
-        size_t main_model_hidden_size = utils::get_hidden_size(main_model),
-               draft_model_hidden_size = utils::get_hidden_size(draft_model);
-        auto k = static_cast<float>(draft_model_hidden_size) / (main_model_hidden_size + draft_model_hidden_size);
+        auto compute_total_hidden_size = [] (const std::vector<KVHeadConfig>& kv_cache_config) -> size_t {
+            size_t total_hidden_size = 0;
+            for (auto & config : kv_cache_config) {
+                total_hidden_size += config.k_head_size * config.num_k_heads + config.v_head_size * config.num_v_heads;
+            }
+            return total_hidden_size;
+        };
+        float main_model_hidden_size = compute_total_hidden_size(main_kv_cache_config),
+              draft_model_hidden_size = compute_total_hidden_size(draft_kv_cache_config);
+        auto k = draft_model_hidden_size / (main_model_hidden_size + draft_model_hidden_size);
 
+        // TODO: work with KV blocks as it will be more precise instead of GBs
         size_t main_cache_size = std::ceil(main_scheduler_config.cache_size * (1.f - k)),
                draft_cache_size = main_scheduler_config.cache_size - main_cache_size;
         if (draft_cache_size == 0 && main_cache_size > 0) {
@@ -64,11 +72,6 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con
 
     ov::AnyMap draft_properties = draft_model_desc.properties.empty() ? main_model_desc.properties : draft_model_desc.properties;
 
-    DeviceConfig main_device_config(main_device), draft_device_config(draft_device);
-
-    utils::set_kv_cache_type_and_shape(main_model, main_device_config);
-    utils::set_kv_cache_type_and_shape(draft_model, draft_device_config);
-
     // main and draft model can have different tokenizers
     // to do: support retokenization: 154103
     Tokenizer main_model_tokenizer = main_model_desc.tokenizer;
@@ -82,10 +85,10 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con
     // to create `main_pipeline` with enabled validation_mode and `draft_pipeline` with disabled validation mode
     m_main_pipeline = std::make_shared<ContinuousBatchingForSpeculativeDecodingImpl>(
         main_model, main_model_tokenizer, main_model_desc.generation_config,
-        main_device_config, main_scheduler_config_updated, main_device, main_model_desc.properties, true);
+        main_kv_cache_config, main_scheduler_config_updated, main_device, main_model_desc.properties, true);
     m_draft_pipeline = std::make_shared<ContinuousBatchingForSpeculativeDecodingImpl>(
         draft_model, draft_model_tokenizer, draft_model_desc.generation_config,
-        draft_device_config, draft_scheduler_config, draft_device, draft_properties, false);
+        draft_kv_cache_config, draft_scheduler_config, draft_device, draft_properties, false);
 }
 
 GenerationHandle
diff --git a/tests/cpp/cache_manager.cpp b/tests/cpp/cache_manager.cpp
index 864a7b43af..986b342ca7 100644
--- a/tests/cpp/cache_manager.cpp
+++ b/tests/cpp/cache_manager.cpp
@@ -5,7 +5,6 @@
 #include <gtest/gtest.h>
 #include "openvino/runtime/core.hpp"
 #include "scheduler.hpp"
-#include "device_config.hpp"
 #include "cache_manager.hpp"
 #include "helper.hpp"
 
@@ -35,17 +34,15 @@ TEST(TestCacheManager, test_cache_size_param) {
     scheduler_config.max_num_seqs = 2;
 
     const std::string device = "CPU";
-    DeviceConfig device_config("CPU");
     const size_t num_decoder_layers = 12;
-    const std::vector<KVHeadConfig> kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
-    device_config.set_kv_head_configs(kv_heads_config);
+    const std::vector<KVHeadConfig> kv_cache_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
 
     ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request();
-    auto cache_manager = std::make_shared<CacheManager>(request, device_config);
+    auto cache_manager = std::make_shared<CacheManager>(request, kv_cache_config);
     ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers());
     const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, cache_manager->get_block_size_in_bytes());
 
-    auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers());
+    auto block_manager = BlockManager(num_kv_blocks, false, cache_manager->get_block_size(), cache_manager->get_num_decoder_layers());
     cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks());
 
     const size_t kv_cache_total_size = scheduler_config.cache_size * 1024 * 1024 * 1024;
@@ -63,13 +60,10 @@ TEST(TestCacheManager, test_kv_blocks_param) {
     scheduler_config.cache_size = 0;
     scheduler_config.max_num_seqs = 2;
 
-    const std::string device = "CPU";
-    DeviceConfig device_config("CPU");
+    const size_t cpu_block_size = 32;
     const size_t num_decoder_layers = 12;
-    const std::vector<KVHeadConfig> kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
-    device_config.set_kv_head_configs(kv_heads_config);
 
-    auto block_manager = BlockManager(scheduler_config.num_kv_blocks, false, device_config.get_block_size(), num_decoder_layers);
+    auto block_manager = BlockManager(scheduler_config.num_kv_blocks, false, cpu_block_size, num_decoder_layers);
     ASSERT_EQ(block_manager.get_total_number_of_kv_blocks(), scheduler_config.num_kv_blocks);
 }
 
@@ -83,17 +77,15 @@ TEST(TestCacheManager, test_dynamic_cache_increase) {
     scheduler_config.max_num_seqs = 2;
 
     const std::string device = "CPU";
-    DeviceConfig device_config("CPU");
     const size_t num_decoder_layers = 12;
-    const std::vector<KVHeadConfig> kv_heads_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
-    device_config.set_kv_head_configs(kv_heads_config);
+    const std::vector<KVHeadConfig> kv_cache_config(num_decoder_layers, KVHeadConfig { 12, 12, 64, 64 });
 
     ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request();
-    auto cache_manager = std::make_shared<CacheManager>(request, device_config);
+    auto cache_manager = std::make_shared<CacheManager>(request, kv_cache_config);
     size_t block_size_in_bytes = cache_manager->get_block_size_in_bytes();
     const size_t num_kv_blocks = get_num_kv_blocks(scheduler_config.cache_size, block_size_in_bytes);
 
-    auto block_manager = BlockManager(num_kv_blocks, false, device_config.get_block_size(), cache_manager->get_num_decoder_layers());
+    auto block_manager = BlockManager(num_kv_blocks, false, cache_manager->get_block_size(), cache_manager->get_num_decoder_layers());
     ASSERT_EQ(num_decoder_layers, cache_manager->get_num_decoder_layers());
 
     // check initial cache allocation
@@ -115,4 +107,4 @@ TEST(TestCacheManager, test_dynamic_cache_increase) {
     // check that cache does not increase if new blocks were not allocated
     cache_manager->allocate_cache_if_needed(block_manager.get_total_number_of_kv_blocks());
     ASSERT_EQ(get_total_allocated_bytes(cache_manager), 200 * block_size_in_bytes);
-}
\ No newline at end of file
+}
diff --git a/tests/cpp/scheduler.cpp b/tests/cpp/scheduler.cpp
index b6aa5a9b53..1e147203f4 100644
--- a/tests/cpp/scheduler.cpp
+++ b/tests/cpp/scheduler.cpp
@@ -26,9 +26,7 @@ std::shared_ptr<CacheManager> init_cache_manager(SchedulerConfig scheduler_confi
     ov::InferRequest request = core.compile_model(get_dummy_model(core, num_decoder_layers)).create_infer_request();
     const size_t head_size = 64;
     std::vector<KVHeadConfig> kv_head_configs(num_decoder_layers, KVHeadConfig { 12, 12, head_size, head_size });
-    ov::genai::DeviceConfig device_config("CPU");
-    device_config.set_kv_head_configs(kv_head_configs);
-    return std::make_shared<CacheManager>(request, device_config);
+    return std::make_shared<CacheManager>(request, kv_head_configs);
 }
 
 TEST(TestScheduler, general_test) {

From 97bb83ad20317cef9b937f345c67252fbd9808e8 Mon Sep 17 00:00:00 2001
From: Alexander Kozlov <kozzzloff@list.ru>
Date: Thu, 30 Jan 2025 12:05:58 +0400
Subject: [PATCH 11/15] [WWB]: Added initialization of nano-llava in case of
 Transformers model (#1649)

---
 tools/who_what_benchmark/whowhatbench/model_loaders.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/who_what_benchmark/whowhatbench/model_loaders.py b/tools/who_what_benchmark/whowhatbench/model_loaders.py
index c792a3c0b2..8ab73483b3 100644
--- a/tools/who_what_benchmark/whowhatbench/model_loaders.py
+++ b/tools/who_what_benchmark/whowhatbench/model_loaders.py
@@ -173,6 +173,10 @@ def load_visual_text_model(
                     model_id, trust_remote_code=True, device_map=device.lower(), _attn_implementation="eager", use_flash_attention_2=False
                 )
         model.eval()
+        try:
+            model.get_vision_tower().load_model()
+        except Exception:
+            pass
     elif use_genai:
         logger.info("Using OpenVINO GenAI API")
         model = load_visual_text_genai_pipeline(model_id, device, ov_config)

From debf2c60735e9dbae5c338b67f0dd5031ee751d8 Mon Sep 17 00:00:00 2001
From: Ilya Lavrenov <ilya.lavrenov@intel.com>
Date: Thu, 30 Jan 2025 16:21:20 +0400
Subject: [PATCH 12/15] WWB: simplify code around start_chat / use_template
 (#1650)

See https://github.com/openvinotoolkit/openvino.genai/pull/1533

Co-authored-by: Alexander Kozlov <kozzzloff@list.ru>
---
 tools/who_what_benchmark/whowhatbench/wwb.py | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/tools/who_what_benchmark/whowhatbench/wwb.py b/tools/who_what_benchmark/whowhatbench/wwb.py
index ce88ef1fab..2008f6aba4 100644
--- a/tools/who_what_benchmark/whowhatbench/wwb.py
+++ b/tools/who_what_benchmark/whowhatbench/wwb.py
@@ -263,13 +263,7 @@ def diff_strings(a: str, b: str, *, use_loguru_colors: bool = False) -> str:
 
 
 def genai_gen_text(model, tokenizer, question, max_new_tokens, skip_question, use_chat_template=False):
-    if use_chat_template:
-        model.start_chat()
-        result = model.generate(question, do_sample=False, max_new_tokens=max_new_tokens)
-        model.finish_chat()
-        return result
-    else:
-        return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens, apply_chat_template=False)
+    return model.generate(question, do_sample=False, max_new_tokens=max_new_tokens, apply_chat_template=use_chat_template)
 
 
 def llamacpp_gen_text(model, tokenizer, question, max_new_tokens, skip_question, use_chat_template=False):
@@ -335,15 +329,7 @@ def genai_gen_inpainting(model, prompt, image, mask, num_inference_steps, genera
 
 def genai_gen_visual_text(model, prompt, image, processor, tokenizer, max_new_tokens, crop_question):
     image_data = ov.Tensor(np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.uint8))
-    config = model.get_generation_config()
-    config.max_new_tokens = max_new_tokens
-    config.do_sample = False
-    config.apply_chat_template = False
-    model.set_generation_config(config)
-
-    model.start_chat()
-    out = model.generate(prompt, image=image_data)
-    model.finish_chat()
+    out = model.generate(prompt, image=image_data, do_sample=False, max_new_tokens=max_new_tokens)
     return out.texts[0]
 
 

From 0efa8a5b3bf7ffb06caa810c80ead149673ca5ab Mon Sep 17 00:00:00 2001
From: Ilya Lavrenov <ilya.lavrenov@intel.com>
Date: Thu, 30 Jan 2025 16:42:15 +0400
Subject: [PATCH 13/15] Tokenizers update (#1653)

To pick up
https://github.com/openvinotoolkit/openvino_tokenizers/pull/391
---
 thirdparty/openvino_tokenizers | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/thirdparty/openvino_tokenizers b/thirdparty/openvino_tokenizers
index 09c7005e0d..83dda59a58 160000
--- a/thirdparty/openvino_tokenizers
+++ b/thirdparty/openvino_tokenizers
@@ -1 +1 @@
-Subproject commit 09c7005e0da46a50cc86b0e6e4ac9b8663a7af70
+Subproject commit 83dda59a5820334ce833f7a326bfb2911fc2576a

From 40cb8498fec15be8f7e8b1475a972501235ea794 Mon Sep 17 00:00:00 2001
From: Ilya Lavrenov <ilya.lavrenov@intel.com>
Date: Thu, 30 Jan 2025 17:54:19 +0400
Subject: [PATCH 14/15] DOCS: reorganized support models for image generation
 (#1655)

---
 SUPPORTED_MODELS.md | 80 ++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 45 deletions(-)

diff --git a/SUPPORTED_MODELS.md b/SUPPORTED_MODELS.md
index c5c55b8d73..d8e9dbe191 100644
--- a/SUPPORTED_MODELS.md
+++ b/SUPPORTED_MODELS.md
@@ -166,6 +166,7 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
       <th>Architecture</th>
       <th>Text 2 image</th>
       <th>Image 2 image</th>
+      <th>Inpainting</th>
       <th>LoRA support</th>
       <th>Example HuggingFace Models</th>
     </tr>
@@ -174,6 +175,7 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
       <td>Supported</td>
       <td>Supported</td>
       <td>Supported</td>
+      <td>Supported</td>
       <td>
         <ul>
           <li><a href="https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7"><code>SimianLuo/LCM_Dreamshaper_v7</code></a></li>
@@ -185,13 +187,13 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
       <td>Supported</td>
       <td>Supported</td>
       <td>Supported</td>
+      <td>Supported</td>
       <td>
         <ul>
           <li><a href="https://huggingface.co/CompVis/stable-diffusion-v1-1"><code>CompVis/stable-diffusion-v1-1</code></a></li>
           <li><a href="https://huggingface.co/CompVis/stable-diffusion-v1-2"><code>CompVis/stable-diffusion-v1-2</code></a></li>
           <li><a href="https://huggingface.co/CompVis/stable-diffusion-v1-3"><code>CompVis/stable-diffusion-v1-3</code></a></li>
           <li><a href="https://huggingface.co/CompVis/stable-diffusion-v1-4"><code>CompVis/stable-diffusion-v1-4</code></a></li>
-          <li><a href="https://huggingface.co/junnyu/stable-diffusion-v1-4-paddle"><code>junnyu/stable-diffusion-v1-4-paddle</code></a></li>
           <li><a href="https://huggingface.co/jcplus/stable-diffusion-v1-5"><code>jcplus/stable-diffusion-v1-5</code></a></li>
           <li><a href="https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5"><code>stable-diffusion-v1-5/stable-diffusion-v1-5</code></a></li>
           <li><a href="https://huggingface.co/botp/stable-diffusion-v1-5"><code>botp/stable-diffusion-v1-5</code></a></li>
@@ -213,16 +215,45 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
         </ul>
       </td>
     </tr>
+    <tr>
+      <td><code>Stable Diffusion Inpainting</code></td>
+      <td>Not applicable</td>
+      <td>Not applicable</td>
+      <td>Supported</td>
+      <td>Supported</td>
+      <td>
+        <ul>
+          <li><a href="https://huggingface.co/stabilityai/stable-diffusion-2-inpainting"><code>stabilityai/stable-diffusion-2-inpainting</code></a></li>
+          <li><a href="https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting"><code>stable-diffusion-v1-5/stable-diffusion-inpainting</code></a></li>
+          <li><a href="https://huggingface.co/botp/stable-diffusion-v1-5-inpainting"><code>botp/stable-diffusion-v1-5-inpainting</code></a></li>
+          <li><a href="https://huggingface.co/parlance/dreamlike-diffusion-1.0-inpainting"><code>parlance/dreamlike-diffusion-1.0-inpainting</code></a></li>
+        </ul>
+      </td>
+    </tr>
     <tr>
       <td><code>Stable Diffusion XL</code></td>
       <td>Supported</td>
       <td>Supported</td>
       <td>Supported</td>
+      <td>Supported</td>
       <td>
         <ul>
           <li><a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9"><code>stabilityai/stable-diffusion-xl-base-0.9</code></a></li>
           <li><a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0"><code>stabilityai/stable-diffusion-xl-base-1.0</code></a></li>
           <li><a href="https://huggingface.co/stabilityai/sdxl-turbo"><code>stabilityai/sdxl-turbo</code></a></li>
+          <li><a href="https://huggingface.co/cagliostrolab/animagine-xl-4.0"><code>cagliostrolab/animagine-xl-4.0</code></a></li>
+        </ul>
+      </td>
+    </tr>
+    <tr>
+      <td><code>Stable Diffusion XL Inpainting</code></td>
+      <td>Not applicable</td>
+      <td>Not applicable</td>
+      <td>Supported</td>
+      <td>Supported</td>
+      <td>
+        <ul>
+          <li><a href="https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1"><code>diffusers/stable-diffusion-xl-1.0-inpainting-0.1</code></a></li>
         </ul>
       </td>
     </tr>
@@ -231,6 +262,7 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
       <td>Supported</td>
       <td>Not supported</td>
       <td>Not supported</td>
+      <td>Not supported</td>
       <td>
         <ul>
           <li><a href="https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers"><code>stabilityai/stable-diffusion-3-medium-diffusers</code></a></li>
@@ -244,6 +276,7 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
       <td>Supported</td>
       <td>Supported</td>
       <td>Not supported</td>
+      <td>Not supported</td>
       <td>
         <ul>
           <li><a href="https://huggingface.co/black-forest-labs/FLUX.1-schnell"><code>black-forest-labs/FLUX.1-schnell</code></a></li>
@@ -251,6 +284,7 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
           <li><a href="https://huggingface.co/black-forest-labs/FLUX.1-dev"><code>black-forest-labs/FLUX.1-dev</code></a></li>
           <li><a href="https://huggingface.co/shuttleai/shuttle-3-diffusion"><code>shuttleai/shuttle-3-diffusion</code></a></li>
           <li><a href="https://huggingface.co/shuttleai/shuttle-3.1-aesthetic"><code>shuttleai/shuttle-3.1-aesthetic</code></a></li>
+          <li><a href="https://huggingface.co/shuttleai/shuttle-jaguar"><code>shuttleai/shuttle-jaguar</code></a></li>
           <li><a href="https://huggingface.co/Shakker-Labs/AWPortrait-FL"><code>Shakker-Labs/AWPortrait-FL</code></a></li>
         </ul>
       </td>
@@ -259,50 +293,6 @@ The pipeline can work with other similar topologies produced by `optimum-intel`
   </tbody>
 </table>
 
-## Inpainting models
-
-In addition to image generation models, `InpaintingPipeline` supports specialized inpainting models
-
-<table>
-  <tbody style="vertical-align: top;">
-    <tr>
-      <th>Architecture</th>
-      <th>LoRA support</th>
-      <th>Example HuggingFace Models</th>
-    </tr>
-    <tr>
-      <td><code>Stable Diffusion</code></td>
-      <td>Supported</td>
-      <td>
-        <ul>
-          <li><a href="https://huggingface.co/stabilityai/stable-diffusion-2-inpainting"><code>stabilityai/stable-diffusion-2-inpainting</code></a></li>
-          <li><a href="https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting"><code>stable-diffusion-v1-5/stable-diffusion-inpainting</code></a></li>
-          <li><a href="https://huggingface.co/botp/stable-diffusion-v1-5-inpainting"><code>botp/stable-diffusion-v1-5-inpainting</code></a></li>
-          <li><a href="https://huggingface.co/parlance/dreamlike-diffusion-1.0-inpainting"><code>parlance/dreamlike-diffusion-1.0-inpainting</code></a></li>
-        </ul>
-      </td>
-    </tr>
-    <tr>
-      <td><code>Stable Diffusion XL</code></td>
-      <td>Supported</td>
-      <td>
-        <ul>
-          <li><a href="https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1"><code>diffusers/stable-diffusion-xl-1.0-inpainting-0.1</code></a></li>
-        </ul>
-      </td>
-    </tr>
-    <!-- <tr>
-      <td><code>FLUX</code></td>
-      <td>Not supported</td>
-      <td>
-        <ul>
-          <li><a href="https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev"><code>black-forest-labs/FLUX.1-Fill-dev</code></a></li>
-        </ul>
-      </td>
-    </tr> -->
-  </tbody>
-</table>
-
 ## Visual language models
 
 <table>

From b10ebcf9067039cdb6f657b443d95f386da0554c Mon Sep 17 00:00:00 2001
From: Sofya Balandina <sofya.balandina@intel.com>
Date: Thu, 30 Jan 2025 13:54:43 +0000
Subject: [PATCH 15/15] Fix using lm_bemch/wwb with version w/o
 apply_chat_template (#1651)

---
 tools/llm_bench/task/text_generation.py            | 6 ++++--
 tools/llm_bench/task/visual_language_generation.py | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/llm_bench/task/text_generation.py b/tools/llm_bench/task/text_generation.py
index 7b123cc7b3..56c5f9da1c 100644
--- a/tools/llm_bench/task/text_generation.py
+++ b/tools/llm_bench/task/text_generation.py
@@ -234,7 +234,8 @@ def run_text_generation_genai(input_text, num, model, tokenizer, args, iter_data
     gen_config.rng_seed = args["seed"]
     gen_config.num_beams = args["num_beams"]
     gen_config.do_sample = False
-    gen_config.apply_chat_template = False
+    if hasattr(gen_config, 'apply_chat_template'):
+        gen_config.apply_chat_template = False
     if args.get('draft_model', ''):
         config_info = "Speculative decoding config: "
         if args.get('num_assistant_tokens', None):
@@ -382,7 +383,8 @@ def run_text_generation_genai_with_stream(input_text, num, model, tokenizer, arg
     gen_config.num_beams = args["num_beams"]
     gen_config.do_sample = False
     gen_config.ignore_eos = True
-    gen_config.apply_chat_template = False
+    if hasattr(gen_config, 'apply_chat_template'):
+        gen_config.apply_chat_template = False
     enable_prompt_permutations = not args.get("disable_prompt_permutation", False)
     if enable_prompt_permutations:
         log.warning(
diff --git a/tools/llm_bench/task/visual_language_generation.py b/tools/llm_bench/task/visual_language_generation.py
index 9cc6702999..c63133e8ce 100644
--- a/tools/llm_bench/task/visual_language_generation.py
+++ b/tools/llm_bench/task/visual_language_generation.py
@@ -211,7 +211,8 @@ def run_visual_language_generation_genai(
     gen_config.max_new_tokens = max_gen_tokens
     gen_config.num_beams = args["num_beams"]
     gen_config.do_sample = False
-    gen_config.apply_chat_template = False
+    if hasattr(gen_config, 'apply_chat_template'):
+        gen_config.apply_chat_template = False
     kwargs = {}
     if len(images) >= 1:
         kwargs["images"] = images[0]

Architecture	LoRA support	Example HuggingFace Models
`Stable Diffusion`	Supported	- - `stabilityai/stable-diffusion-2-inpainting` - `stable-diffusion-v1-5/stable-diffusion-inpainting` - `botp/stable-diffusion-v1-5-inpainting` - `parlance/dreamlike-diffusion-1.0-inpainting` - -
`Stable Diffusion XL`	Supported	- - `diffusers/stable-diffusion-xl-1.0-inpainting-0.1` - -