Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Branch] KV Cache Interface #1083

Merged
merged 109 commits into from
Jul 12, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
48ac0ac
initial commit
dbogunowicz Jun 5, 2023
cf7f2b9
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
832630a
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 6, 2023
9958c83
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 7, 2023
e6d2b03
limit to 150mb
dbogunowicz Jun 7, 2023
7f9935b
ready to review
dbogunowicz Jun 7, 2023
b1cf01b
initial commit
dbogunowicz Mar 2, 2023
0a3f48d
[Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946)
dbogunowicz Mar 16, 2023
add4625
[CodeGen][Documentation] (#956)
dbogunowicz Mar 23, 2023
22d2746
reimplementation for generative pipelines
markurtz May 8, 2023
7f1651d
restore text generation from examples
dbogunowicz May 8, 2023
b85746d
[CodeGen] ONNX model loading to support >2Gb models / two engines (#991)
dbogunowicz May 8, 2023
aadc608
refactor sucessfull
dbogunowicz May 10, 2023
58bc2b0
Pipeline fully refactored, time to test engine support. Note: Sliding…
dbogunowicz May 11, 2023
d538444
First iteration with Sage
dbogunowicz May 11, 2023
e19676b
Apply suggestions from code review
dbogunowicz May 11, 2023
7908b74
ORT agrees with the Engine. But they both give not entirely correct r…
dbogunowicz May 11, 2023
4bc3472
dynamic ORT vs static DS
dbogunowicz May 12, 2023
c07f7ed
pipeline handles OPT multitoken pass
dbogunowicz May 16, 2023
fb77838
fixes to get static pipeline a little further along
May 16, 2023
2097463
adjust shapes and slicing to enable static autoregressive pass - ISSU…
May 17, 2023
5eb10a9
migrate from cache_length to positions input
May 18, 2023
9213f29
got if working for multitoken + single token scenario
dbogunowicz May 18, 2023
d9af004
cleanup the pipeline
dbogunowicz May 19, 2023
476f25d
further cleanup post merge
dbogunowicz May 19, 2023
fab44e4
Pipeline working for single-token inference only
dbogunowicz May 19, 2023
d454e2f
do not load the onnx model with external files twice
dbogunowicz May 19, 2023
1613e25
pipeline never redundantly saves the external data + more robust toke…
dbogunowicz May 19, 2023
b61055c
Stop saving tmp files, otherwise the engine looks for external files …
dbogunowicz May 19, 2023
6ee25fc
Left pad support
May 19, 2023
5d3004b
cleanup
dbogunowicz May 22, 2023
ace6fa5
cleanup2
dbogunowicz May 22, 2023
388586d
Add in pipeline timing
markurtz May 24, 2023
afd0139
add in force tokens logic
markurtz May 24, 2023
30eeda7
remove input validation for text generation pipelines
markurtz May 24, 2023
5882b56
remove multitoken support for now
markurtz May 24, 2023
4bbe33d
remove kv cache engine and other fixes
markurtz May 25, 2023
afa5746
nest input shape override
markurtz May 25, 2023
e2bb78c
comment out input shape override
markurtz May 25, 2023
2299009
add non batch override for ORT
markurtz May 25, 2023
2935b77
clean up generation pipeline
markurtz Jun 9, 2023
b89b156
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 11, 2023
dc3d61b
initial commit
dbogunowicz Jun 5, 2023
a294265
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
af97f2b
limit to 150mb
dbogunowicz Jun 7, 2023
c117788
ready to review
dbogunowicz Jun 7, 2023
4ad5f49
fix the erronous Makefile
dbogunowicz Jun 13, 2023
9e816bb
Merge branch 'feature/damian/do_not_save_to_tmp' of https://github.co…
dbogunowicz Jun 13, 2023
f97467f
perhaps fixed GHA
dbogunowicz Jun 13, 2023
6be8d87
take into consideration that GHA creates four files
dbogunowicz Jun 13, 2023
e2f088d
initial commit
dbogunowicz Jun 13, 2023
9fc6c64
Merge remote-tracking branch 'origin/feature/damian/do_not_save_to_tm…
dbogunowicz Jun 13, 2023
a610faf
tested with actual model
dbogunowicz Jun 13, 2023
347d1fb
remove val_inp argument
dbogunowicz Jun 13, 2023
e11027c
Update README.md
dbogunowicz Jun 13, 2023
a950910
Apply suggestions from code review
dbogunowicz Jun 13, 2023
c1d02dc
Update README.md
dbogunowicz Jun 13, 2023
711cdfb
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 13, 2023
e602662
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 14, 2023
2085c37
[BugFix] Update deepsparse dockerfile (#1069)
rahul-tuli Jun 14, 2023
2f7bc95
initial implementation
dbogunowicz Jun 15, 2023
e18fab7
working implementation for pipeline input
dbogunowicz Jun 16, 2023
0358d87
[Fix] Fix CLI benchmark errors (#1071)
dbogunowicz Jun 15, 2023
06b5246
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 16, 2023
2cab681
Merge branch 'feature/damian/codegen_pipeline_clean' into feature/dam…
dbogunowicz Jun 16, 2023
63b116b
Clean a typo in the pipeline code
dbogunowicz Jun 16, 2023
cde08b9
initial commit
dbogunowicz Jun 21, 2023
99d125c
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jun 22, 2023
67ffe47
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jun 26, 2023
9937686
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jun 27, 2023
0d6a423
[KV Cache Interface] DecoderKVCache (#1084)
dbogunowicz Jun 28, 2023
0809aea
[WiP] [KV Cache Interface] Text Generation & Decoder Engine Implement…
dbogunowicz Jun 28, 2023
7001a6e
working implementation, time to cleanup
dbogunowicz Jun 29, 2023
c1bf5b7
now kv cache decoder holds information about the num of tokens prepro…
dbogunowicz Jun 29, 2023
79251e6
cleanup the old files
dbogunowicz Jun 29, 2023
9efbdb6
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jun 29, 2023
da5e93e
ready for review
dbogunowicz Jun 29, 2023
a680dac
ready for testing
dbogunowicz Jun 29, 2023
7099994
managed to get first logits right
dbogunowicz Jun 29, 2023
1d4d96d
Delete example
dbogunowicz Jun 29, 2023
08e5421
cleanup before sharing with Ben and Sage
dbogunowicz Jun 29, 2023
bfaa072
Merge branch 'feature/damian/pipeline_engine_support' of https://gith…
dbogunowicz Jun 29, 2023
fbeeb4a
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jun 29, 2023
f83dcab
assert proper padding on pipeline init
dbogunowicz Jul 3, 2023
e659c33
now also supporting kv cache perplexity. time for cleanup
dbogunowicz Jul 3, 2023
cf74ad7
ready for review
dbogunowicz Jul 3, 2023
853f876
correctly print engine info
dbogunowicz Jul 3, 2023
e8da07e
work with left padding of the tokenizer
dbogunowicz Jul 3, 2023
58b12c8
quality
dbogunowicz Jul 3, 2023
eecd232
fix the multitoken inference
dbogunowicz Jul 5, 2023
10c804a
Perplexity Eval for Text Generation Models (#1073)
dbogunowicz Jul 5, 2023
7bd23d6
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jul 5, 2023
10ba82e
[Text Generation] Run deepsparse engine without the LIB.kv_cache obje…
dbogunowicz Jul 7, 2023
e81c327
added few improvements that turned out to be useful post manual testing
dbogunowicz Jul 7, 2023
b737f77
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jul 7, 2023
042cb79
fixed the logic to assert correct multibatch inference
dbogunowicz Jul 7, 2023
bf4eac3
Merge branch 'feature/damian/fb_kv_cache' of https://github.com/neura…
dbogunowicz Jul 7, 2023
c8a1f93
fix integration tests
dbogunowicz Jul 7, 2023
d2d3dc1
initial implementation
dbogunowicz Jul 10, 2023
6ce1ca4
perplexity working, so as batched inference for different sized inputs
dbogunowicz Jul 10, 2023
47dc986
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jul 10, 2023
ef77d91
fix the integration test
dbogunowicz Jul 10, 2023
f0d74b0
Merge branch 'feature/damian/fb_kv_cache' of https://github.com/neura…
dbogunowicz Jul 10, 2023
186c80c
better solution for fixing the issues caused by this PR in GHA
dbogunowicz Jul 10, 2023
09993e7
revert changes to yolo pipeline
dbogunowicz Jul 10, 2023
ba8c126
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jul 11, 2023
37e8a02
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jul 11, 2023
0d308b9
response to Rahuls comments
dbogunowicz Jul 11, 2023
41e9306
Merge remote-tracking branch 'origin/main' into feature/damian/fb_kv_…
dbogunowicz Jul 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 0 additions & 48 deletions src/deepsparse/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@
from deepsparse.benchmark import BenchmarkResults
from deepsparse.utils import (
generate_random_inputs,
get_output_names,
model_to_path,
override_onnx_input_shapes,
)
Expand All @@ -54,7 +53,6 @@
"Scheduler",
"Context",
"MultiModelEngine",
"KVCacheEngine",
"BaseEngine",
]

Expand Down Expand Up @@ -845,52 +843,6 @@ def __init__(
)


class KVCacheEngine(Engine):
"""
Engine that can do kv caching.
"""

def __init__(
self,
model: Union[str, "Model", "File"],
batch_size: int = 1,
num_cores: int = None,
num_streams: int = None,
scheduler: Scheduler = None,
input_shapes: List[List[int]] = None,
kv_cache_bools: List[bool] = None,
prev_cache_length: int = 0,
):
BaseEngine.construct(
self, model, batch_size, num_cores, num_streams, scheduler, input_shapes
)

if kv_cache_bools is None:
# If no list was provided, then we assume all outputs except for the first are KV caches
# Note: In the future we can look at the names of outputs to be more sure
#
# Create a boolean list of every output of the model
output_names = get_output_names(self._model_path)
kv_cache_bools = [True for i in range(len(output_names))]
# Assume first input is logits and logits ought not to be cached
kv_cache_bools[0] = False

num_streams = _validate_num_streams(num_streams, self._num_cores)
if self._input_shapes:
raise NotImplementedError("Don't do this yet :)")
else:
self._eng_net = LIB.deepsparse_engine(
self._model_path,
self._batch_size,
self._num_cores,
num_streams,
self._scheduler.value,
None,
kv_cache_bools,
prev_cache_length,
)


def compile_model(
model: Union[str, "Model", "File"],
batch_size: int = 1,
Expand Down
56 changes: 35 additions & 21 deletions src/deepsparse/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
"yolo_pipeline",
"Bucketable",
"BucketingPipeline",
"create_engine",
]

DEEPSPARSE_ENGINE = "deepsparse"
Expand Down Expand Up @@ -157,6 +158,7 @@ def __init__(
logger: Optional[Union[BaseLogger, str]] = None,
benchmark: bool = False,
_delay_engine_initialize: bool = False, # internal use only
_delay_overwriting_inputs: bool = False, # internal use only
):
self._benchmark = benchmark
self._model_path_orig = model_path
Expand Down Expand Up @@ -200,7 +202,7 @@ def __init__(
if engine_type.lower() == DEEPSPARSE_ENGINE:
self._engine_args["scheduler"] = scheduler

self.onnx_file_path = self.setup_onnx_file_path()
self.onnx_file_path = self.setup_onnx_file_path(_delay_overwriting_inputs)

if _delay_engine_initialize:
self.engine = None
Expand Down Expand Up @@ -810,26 +812,10 @@ def log_inference_times(self, timer: StagedTimer):
category=MetricCategories.SYSTEM,
)

def _initialize_engine(self) -> Union[Engine, ORTEngine]:
engine_type = self.engine_type.lower()

if engine_type == DEEPSPARSE_ENGINE:
if self.context is not None and isinstance(self.context, Context):
self._engine_args.pop("num_cores", None)
self._engine_args.pop("scheduler", None)
self._engine_args["context"] = self.context
return MultiModelEngine(
model=self.onnx_file_path,
**self._engine_args,
)
return Engine(self.onnx_file_path, **self._engine_args)
elif engine_type == ORT_ENGINE:
return ORTEngine(self.onnx_file_path, **self._engine_args)
else:
raise ValueError(
f"Unknown engine_type {self.engine_type}. Supported values include: "
f"{SUPPORTED_PIPELINE_ENGINES}"
)
def _initialize_engine(self) -> Union[Engine, MultiModelEngine, ORTEngine]:
return create_engine(
self.onnx_file_path, self.engine_type, self._engine_args, self.context
)

def _identifier(self):
# get pipeline identifier; used in the context of logging
Expand Down Expand Up @@ -1007,6 +993,34 @@ def route_input_to_bucket(
pass


def create_engine(
onnx_file_path: str,
engine_type: str,
engine_args: Dict,
context: Optional[Context] = None,
) -> Union[Engine, MultiModelEngine, ORTEngine]:
engine_type = engine_type.lower()
rahul-tuli marked this conversation as resolved.
Show resolved Hide resolved

if engine_type == DEEPSPARSE_ENGINE:
if context is not None and isinstance(context, Context):
engine_args.pop("num_cores", None)
engine_args.pop("scheduler", None)
engine_args["context"] = context
return MultiModelEngine(
model=onnx_file_path,
**engine_args,
)
return Engine(onnx_file_path, **engine_args)

if engine_type == ORT_ENGINE:
return ORTEngine(onnx_file_path, **engine_args)

raise ValueError(
f"Unknown engine_type {engine_type}. Supported values include: "
f"{SUPPORTED_PIPELINE_ENGINES}"
)


def _initialize_executor_and_workers(
batch_size: Optional[int],
workers_or_executor: Optional[Union[int, ThreadPoolExecutor]],
Expand Down
23 changes: 23 additions & 0 deletions src/deepsparse/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,12 @@ class SupportedTasks:
),
)

text_generation = namedtuple("text_generation", ["opt", "codegen", "bloom"])(
codegen=AliasedTask("codegen", []),
dbogunowicz marked this conversation as resolved.
Show resolved Hide resolved
opt=AliasedTask("opt", []),
bloom=AliasedTask("bloom", []),
)

image_classification = namedtuple("image_classification", ["image_classification"])(
image_classification=AliasedTask(
"image_classification",
Expand Down Expand Up @@ -150,6 +156,9 @@ def check_register_task(
# custom task, register the CustomPipeline
import deepsparse.pipelines.custom_pipeline # noqa: F401

elif cls.is_text_generation(task):
import deepsparse.transformers.pipelines.text_generation # noqa: F401

elif cls.is_nlp(task):
# trigger transformers pipelines to register with Pipeline.register
import deepsparse.transformers.pipelines # noqa: F401
Expand Down Expand Up @@ -193,6 +202,20 @@ def check_register_task(
f"{list(all_tasks)}"
)

@classmethod
def is_text_generation(cls, task: str) -> bool:
"""
:param task: the name of the task to check whether it is a text generation task
such as codegen
:return: True if it is a text generation task, False otherwise
"""
return any(
[
text_generation_task.matches(task)
for text_generation_task in cls.text_generation
]
dbogunowicz marked this conversation as resolved.
Show resolved Hide resolved
)

@classmethod
def is_nlp(cls, task: str) -> bool:
"""
Expand Down
15 changes: 15 additions & 0 deletions src/deepsparse/transformers/engines/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# flake8: noqa
from .nl_decoder_engine import *
Loading