[Feature Branch] KV Cache Interface #1083

dbogunowicz · 2023-06-21T05:13:57Z

Feature Preview

Feature branch that aggregates all the features constituting the KV Cache Interface implementation. This includes:

No-cache inference

from deepsparse import Pipeline
import time
start = time.time()
opt = Pipeline.create(task="opt",
                      model_path="/home/ubuntu/damian/sparseml/deployment",
                      engine_type = "onnxruntime",
                      max_generated_tokens=1)
prompt = "Who is the president of the United States?"
output = opt(sequences=prompt, return_logits=True)

sequences=['\n'] logits=array([[[-12.644863 , -12.9746065,   2.577626 , ..., -13.5366125,
         -13.376596 , -14.587112 ]]], dtype=float32) session_id=None # same as in pytorch inference

Ground truth: [-12.6449, -12.9746,   2.5776,  ..., -13.5366, -13.3766, -14.5871]

Single-token engine decoding only:

from deepsparse import Pipeline
opt = Pipeline.create(task="opt",
                      model_path="/home/ubuntu/damian/sparseml/deployment",
                      engine_type = "onnxruntime",
                      max_generated_tokens=128)
prompt = "Who is the president of the United States?"
output = opt(sequences=prompt)
print(output.sequences)

2023-06-27 07:55:20 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:55:24 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:56:37 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:56:40 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
['\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government.\n\nThe president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive']

Ground truth: The president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government.

The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government.

Single-token engine and multi-token engine decoding:

from deepsparse import Pipeline
opt = Pipeline.create(task="opt",
                      model_path="/home/ubuntu/damian/sparseml/deployment",
                      engine_type = "onnxruntime",
                      max_generated_tokens=128)
prompt = "Who is the president of the United States?" * 20
output = opt(sequences=prompt)
print(output.sequences)

2023-06-27 07:57:53 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:58:47 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:58:52 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:58:58 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
['Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is']

Ground truth: Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?

Testing Scope

Manual Tests

The script below terminates without raising an error

from deepsparse import Pipeline



def test_pipeline(engine_type):
    opt = Pipeline.create(task="opt",
                          model_path="/home/ubuntu/damian/sparseml/deployment",
                          engine_type=engine_type,
                          max_generated_tokens=32)

    prompt1 = "Who is the president of the United States?"
    prompt2 = "Who is the president of the United States?" * 20

    # test correctness unbatched input for a single-token engine
    out = opt(sequences=prompt1)
    out_ = opt(sequences=[prompt1])
    for x in [out, out_]:
        assert x.sequences[0] == '\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the'

    # test correctness unbatched input for a multi-token engine (very long input)
    out = opt(sequences=prompt2)
    out_ = opt(sequences=[prompt2])
    for x in [out, out_]:
        assert x.sequences[0] == 'Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of'

    # test correctness batched input same input lengths
    out = opt(sequences=[prompt1, prompt1])
    for x in range(2):
        assert out.sequences[x] == '\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the'

    # test correctness batched input different input lengths
    out = opt(sequences=[prompt1, prompt2])
    assert out.sequences[0] == '\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the'
    assert out.sequences[1] == 'Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of'


test_pipeline(engine_type = "onnxruntime")
test_pipeline(engine_type = "deepsparse")

Testing with `eval downstream`

python src/deepsparse/transformers/eval_downstream.py /home/ubuntu/damian/sparseml/deployment --dataset openai_humaneval --engine onnxruntime --max-samples 4

HF baseline:

{'perplexities': [6.606388092041016, 9.294904708862305, 17.560449600219727, 13.867135047912598], 'mean_perplexity': 11.832219362258911}

Result with kv cache model

openai_humaneval eval results: {'mean_perplexity': 11.834174752235413, 'perplexities': [6.607589244842529, 9.296476364135742, 17.56318473815918, 13.8694486618042]}
# results match the baseline

Result with non-kv cache model

openai_humaneval eval results: {'mean_perplexity': 11.834173798561096, 'perplexities': [6.607589244842529, 9.296476364135742, 17.563180923461914, 13.8694486618042]}
# results match the baseline

Current Limitations

we are unable to run the "autoregressive multi-token" inference scenario. To be added in the future
due to the limited deepsparse engine for the kv cache, we are not using LIB.kv_cache object for cache manipulation. We are also unable to run multi-token inference in the engine due to the issue with "zero-length" cache ingestion. In the deepsparse engine inference, whenever a sequence would normally be processed by the multi-token engine, the single-token engine will take over instead
the initialization time for the deepsparse engine is long (a few minutes).

* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate

* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * initial commit * change order * Update examples/codegen/README.md Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com>

… window not yet implemented!

…esult. Hey, this is good news still

…E: tokens past the base seq len are repeated

…nizer

…in the wrong place

src/deepsparse/transformers/engines/nl_decoder_engine.py

…lmagic/deepsparse into feature/damian/fb_kv_cache

bfineran

LGTM - we 100% need a bit more testing, let's make a plan for that. Let's also include the deepsparse vs ort perplexities in the description

src/deepsparse/pipeline.py

src/deepsparse/tasks.py

src/deepsparse/transformers/README.md

src/deepsparse/transformers/engines/nl_decoder_engine.py

src/deepsparse/transformers/eval_downstream.py

Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>

…cache

dbogunowicz and others added 30 commits June 5, 2023 15:55

initial commit

48ac0ac

Update src/deepsparse/license.py

cf7f2b9

Merge branch 'main' into feature/damian/do_not_save_to_tmp

832630a

Merge branch 'main' into feature/damian/do_not_save_to_tmp

9958c83

limit to 150mb

e6d2b03

ready to review

7f9935b

initial commit

b1cf01b

[Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946)

0a3f48d

* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate

reimplementation for generative pipelines

22d2746

restore text generation from examples

7f1651d

[CodeGen] ONNX model loading to support >2Gb models / two engines (#991)

b85746d

refactor sucessfull

aadc608

Pipeline fully refactored, time to test engine support. Note: Sliding…

58bc2b0

… window not yet implemented!

First iteration with Sage

d538444

Apply suggestions from code review

e19676b

ORT agrees with the Engine. But they both give not entirely correct r…

7908b74

…esult. Hey, this is good news still

dynamic ORT vs static DS

4bc3472

pipeline handles OPT multitoken pass

c07f7ed

fixes to get static pipeline a little further along

fb77838

adjust shapes and slicing to enable static autoregressive pass - ISSU…

2097463

…E: tokens past the base seq len are repeated

migrate from cache_length to positions input

5eb10a9

got if working for multitoken + single token scenario

9213f29

cleanup the pipeline

d9af004

further cleanup post merge

476f25d

Pipeline working for single-token inference only

fab44e4

do not load the onnx model with external files twice

d454e2f

pipeline never redundantly saves the external data + more robust toke…

1613e25

…nizer

Stop saving tmp files, otherwise the engine looks for external files …

b61055c

…in the wrong place

Left pad support

6ee25fc

added few improvements that turned out to be useful post manual testing

e81c327

dbogunowicz commented Jul 7, 2023

View reviewed changes

src/deepsparse/transformers/engines/nl_decoder_engine.py Outdated Show resolved Hide resolved

dbogunowicz and others added 11 commits July 7, 2023 14:54

Update src/deepsparse/transformers/engines/nl_decoder_engine.py

b737f77

fixed the logic to assert correct multibatch inference

042cb79

Merge branch 'feature/damian/fb_kv_cache' of https://github.com/neura…

bf4eac3

…lmagic/deepsparse into feature/damian/fb_kv_cache

fix integration tests

c8a1f93

initial implementation

d2d3dc1

perplexity working, so as batched inference for different sized inputs

6ce1ca4

Merge branch 'main' into feature/damian/fb_kv_cache

47dc986

fix the integration test

ef77d91

Merge branch 'feature/damian/fb_kv_cache' of https://github.com/neura…

f0d74b0

…lmagic/deepsparse into feature/damian/fb_kv_cache

better solution for fixing the issues caused by this PR in GHA

186c80c

revert changes to yolo pipeline

09993e7

bfineran previously approved these changes Jul 10, 2023

View reviewed changes

Merge branch 'main' into feature/damian/fb_kv_cache

ba8c126

rahul-tuli previously approved these changes Jul 11, 2023

View reviewed changes

dbogunowicz dismissed stale reviews from rahul-tuli and bfineran via 37e8a02 July 11, 2023 14:57

dbogunowicz and others added 2 commits July 11, 2023 16:57

Update src/deepsparse/transformers/engines/nl_decoder_engine.py

37e8a02

Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>

response to Rahuls comments

0d308b9

bfineran previously approved these changes Jul 11, 2023

View reviewed changes

dbogunowicz mentioned this pull request Jul 12, 2023

[KV Cache] BLOOM support neuralmagic/sparseml#1664

Merged

rahul-tuli previously approved these changes Jul 12, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into feature/damian/fb_kv_…

41e9306

…cache

dbogunowicz dismissed stale reviews from rahul-tuli and bfineran via 41e9306 July 12, 2023 14:46

bfineran approved these changes Jul 12, 2023

View reviewed changes

bfineran merged commit c6aa08f into main Jul 12, 2023

bfineran deleted the feature/damian/fb_kv_cache branch July 12, 2023 15:15

dbogunowicz mentioned this pull request Jul 25, 2023

[Text Generation] Automatically benchmark in auto-regressive setting #1142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Branch] KV Cache Interface #1083

[Feature Branch] KV Cache Interface #1083

dbogunowicz commented Jun 21, 2023 •

edited

Loading

bfineran left a comment

[Feature Branch] KV Cache Interface #1083

[Feature Branch] KV Cache Interface #1083

Conversation

dbogunowicz commented Jun 21, 2023 • edited Loading

Feature Preview

No-cache inference

Single-token engine decoding only:

Single-token engine and multi-token engine decoding:

Testing Scope

Manual Tests

Testing with eval downstream

HF baseline:

Result with kv cache model

Result with non-kv cache model

Current Limitations

bfineran left a comment

Choose a reason for hiding this comment

dbogunowicz commented Jun 21, 2023 •

edited

Loading

Testing with `eval downstream`