Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perplexity Eval for Text Generation Models #1073

Merged
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
48ac0ac
initial commit
dbogunowicz Jun 5, 2023
cf7f2b9
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
832630a
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 6, 2023
9958c83
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 7, 2023
e6d2b03
limit to 150mb
dbogunowicz Jun 7, 2023
7f9935b
ready to review
dbogunowicz Jun 7, 2023
b1cf01b
initial commit
dbogunowicz Mar 2, 2023
0a3f48d
[Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946)
dbogunowicz Mar 16, 2023
add4625
[CodeGen][Documentation] (#956)
dbogunowicz Mar 23, 2023
22d2746
reimplementation for generative pipelines
markurtz May 8, 2023
7f1651d
restore text generation from examples
dbogunowicz May 8, 2023
b85746d
[CodeGen] ONNX model loading to support >2Gb models / two engines (#991)
dbogunowicz May 8, 2023
aadc608
refactor sucessfull
dbogunowicz May 10, 2023
58bc2b0
Pipeline fully refactored, time to test engine support. Note: Sliding…
dbogunowicz May 11, 2023
d538444
First iteration with Sage
dbogunowicz May 11, 2023
e19676b
Apply suggestions from code review
dbogunowicz May 11, 2023
7908b74
ORT agrees with the Engine. But they both give not entirely correct r…
dbogunowicz May 11, 2023
4bc3472
dynamic ORT vs static DS
dbogunowicz May 12, 2023
c07f7ed
pipeline handles OPT multitoken pass
dbogunowicz May 16, 2023
fb77838
fixes to get static pipeline a little further along
May 16, 2023
2097463
adjust shapes and slicing to enable static autoregressive pass - ISSU…
May 17, 2023
5eb10a9
migrate from cache_length to positions input
May 18, 2023
9213f29
got if working for multitoken + single token scenario
dbogunowicz May 18, 2023
d9af004
cleanup the pipeline
dbogunowicz May 19, 2023
476f25d
further cleanup post merge
dbogunowicz May 19, 2023
fab44e4
Pipeline working for single-token inference only
dbogunowicz May 19, 2023
d454e2f
do not load the onnx model with external files twice
dbogunowicz May 19, 2023
1613e25
pipeline never redundantly saves the external data + more robust toke…
dbogunowicz May 19, 2023
b61055c
Stop saving tmp files, otherwise the engine looks for external files …
dbogunowicz May 19, 2023
6ee25fc
Left pad support
May 19, 2023
5d3004b
cleanup
dbogunowicz May 22, 2023
ace6fa5
cleanup2
dbogunowicz May 22, 2023
388586d
Add in pipeline timing
markurtz May 24, 2023
afd0139
add in force tokens logic
markurtz May 24, 2023
30eeda7
remove input validation for text generation pipelines
markurtz May 24, 2023
5882b56
remove multitoken support for now
markurtz May 24, 2023
4bbe33d
remove kv cache engine and other fixes
markurtz May 25, 2023
afa5746
nest input shape override
markurtz May 25, 2023
e2bb78c
comment out input shape override
markurtz May 25, 2023
2299009
add non batch override for ORT
markurtz May 25, 2023
2935b77
clean up generation pipeline
markurtz Jun 9, 2023
b89b156
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 11, 2023
dc3d61b
initial commit
dbogunowicz Jun 5, 2023
a294265
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
af97f2b
limit to 150mb
dbogunowicz Jun 7, 2023
c117788
ready to review
dbogunowicz Jun 7, 2023
4ad5f49
fix the erronous Makefile
dbogunowicz Jun 13, 2023
9e816bb
Merge branch 'feature/damian/do_not_save_to_tmp' of https://github.co…
dbogunowicz Jun 13, 2023
f97467f
perhaps fixed GHA
dbogunowicz Jun 13, 2023
6be8d87
take into consideration that GHA creates four files
dbogunowicz Jun 13, 2023
e2f088d
initial commit
dbogunowicz Jun 13, 2023
9fc6c64
Merge remote-tracking branch 'origin/feature/damian/do_not_save_to_tm…
dbogunowicz Jun 13, 2023
a610faf
tested with actual model
dbogunowicz Jun 13, 2023
347d1fb
remove val_inp argument
dbogunowicz Jun 13, 2023
e11027c
Update README.md
dbogunowicz Jun 13, 2023
a950910
Apply suggestions from code review
dbogunowicz Jun 13, 2023
c1d02dc
Update README.md
dbogunowicz Jun 13, 2023
711cdfb
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 13, 2023
e602662
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 14, 2023
2085c37
[BugFix] Update deepsparse dockerfile (#1069)
rahul-tuli Jun 14, 2023
2f7bc95
initial implementation
dbogunowicz Jun 15, 2023
e18fab7
working implementation for pipeline input
dbogunowicz Jun 16, 2023
0358d87
[Fix] Fix CLI benchmark errors (#1071)
dbogunowicz Jun 15, 2023
06b5246
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 16, 2023
2cab681
Merge branch 'feature/damian/codegen_pipeline_clean' into feature/dam…
dbogunowicz Jun 16, 2023
63b116b
Clean a typo in the pipeline code
dbogunowicz Jun 16, 2023
7001a6e
working implementation, time to cleanup
dbogunowicz Jun 29, 2023
79251e6
cleanup the old files
dbogunowicz Jun 29, 2023
9efbdb6
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jun 29, 2023
da5e93e
ready for review
dbogunowicz Jun 29, 2023
a680dac
ready for testing
dbogunowicz Jun 29, 2023
f83dcab
assert proper padding on pipeline init
dbogunowicz Jul 3, 2023
e659c33
now also supporting kv cache perplexity. time for cleanup
dbogunowicz Jul 3, 2023
cf74ad7
ready for review
dbogunowicz Jul 3, 2023
853f876
correctly print engine info
dbogunowicz Jul 3, 2023
e8da07e
work with left padding of the tokenizer
dbogunowicz Jul 3, 2023
58b12c8
quality
dbogunowicz Jul 3, 2023
eecd232
fix the multitoken inference
dbogunowicz Jul 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions src/deepsparse/transformers/engines/nl_decoder_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,10 +154,7 @@ def __call__(
else:
logits = out[0]

B, S, V = logits.shape # batch, sequence, vocab
logits = logits[:, -1, :].reshape(B, 1, V) # only take the last token

token = self.generate_token(logits=logits)
token = self.generate_token(logits=logits[:, -1, :])

return token, logits
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need all the logits that are predicted from sequences: {}, {x1}, {x1, x2}, ... {x1, x2, ... x_n}


Expand Down
41 changes: 36 additions & 5 deletions src/deepsparse/transformers/eval_downstream.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
from tqdm.auto import tqdm

from deepsparse import Pipeline
from deepsparse.transformers.metrics import PrecisionRecallF1
from deepsparse.transformers.metrics import Perplexity, PrecisionRecallF1


from datasets import load_dataset, load_metric # isort: skip
Expand All @@ -78,6 +78,34 @@
ORT_ENGINE = "onnxruntime"


def perplexity_eval(args, batch_size=16, dataset_name="openai_humaneval"):
dataset = load_dataset(dataset_name)["test"]

text_generation = Pipeline.create(
task="text-generation",
model_path=args.model_path,
# TODO: make sure this also works for deepsparse engine
engine_type="onnxruntime",
num_cores=args.num_cores,
sequence_length=args.max_sequence_length,
prompt_processing_sequence_length=args.max_sequence_length,
max_generated_tokens=1,
tokenizer_padding_side="right",
)
perplexity_metrics = Perplexity(pipeline=text_generation, batch_size=batch_size)
# TODO: text_generation.engine is None
print(f"Engine info: {text_generation.engine}")
predictions = []
for idx, sample in _enumerate_progress(dataset, args.max_samples):
predictions.append(sample["prompt"] + sample["canonical_solution"])
if len(predictions) == batch_size:
perplexity_metrics.add_batch(predictions)
predictions = []
if idx == 32:
break
return perplexity_metrics


def qa_eval(args, dataset_name="squad"):
# load validation dataset and eval tool
dataset = load_dataset(dataset_name)["validation"]
Expand Down Expand Up @@ -443,16 +471,20 @@ def _split_train_val(train_dataset, val_ratio, seed=42):
"imdb": imdb_eval,
"conll2003": conll2003_eval,
"go_emotions": go_emotions_eval,
"openai_humaneval": perplexity_eval,
}


def parse_args():
parser = argparse.ArgumentParser(
# TODO: Not BERT anymore
description="Evaluate a BERT ONNX model on a downstream dataset"
)
parser.add_argument(
"model_path",
"-m",
"--model_path",
type=str,
default="/home/ubuntu/damian/sparseml/deployment",
help=(
"The path to a directory containing model.onnx, config.json, and "
"tokenizer.json files or SparseZoo stub to the model"
Expand All @@ -462,8 +494,7 @@ def parse_args():
"-d",
"--dataset",
type=str,
choices=list(SUPPORTED_DATASETS.keys()),
required=True,
default="openai_humaneval",
)
parser.add_argument(
"-v",
Expand Down Expand Up @@ -516,7 +547,7 @@ def parse_args():
"--max-samples",
help="the max number of samples to evaluate. Default is None or all samples",
type=int,
default=None,
default=32,
)

parser.add_argument(
Expand Down
86 changes: 85 additions & 1 deletion src/deepsparse/transformers/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,102 @@
"""


from typing import Dict, Optional
from typing import Any, Dict, List, Optional

import numpy
from tqdm import tqdm

import torch
from deepsparse import Pipeline
from deepsparse.transformers.pipelines.text_generation import TextGenerationPipeline
from sklearn.metrics import precision_recall_fscore_support


__all__ = [
"PrecisionRecallF1",
"Perplexity",
]


class Perplexity:
def __init__(self, pipeline: Pipeline, batch_size: int = 16):
"""
Given the pipeline, compute the perplexity of the model
on the given text input.

Code adapted from:
https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py # noqa: E501

:param pipeline: The pipeline to use for text generation
:param batch_size: The batch size to split the input text into
non-overlapping batches
"""
if not isinstance(pipeline, TextGenerationPipeline):
raise ValueError(
"Perplexity can only be computed for text generation pipelines"
)
self._pipeline = pipeline
self._batch_size = batch_size
self._sequence_length = pipeline.sequence_length
self._loss_fct = torch.nn.CrossEntropyLoss(reduction="none")

self.perplexities = []

def add_batch(self, predictions: List[str]):
"""
Run the model on the given input sequences and compute the perplexity.
The resulting perplexity is appended to the list of perplexities.

:param predictions: The predictions to compute perplexity on
"""
# tokenize the input text
encodings = self._pipeline.tokenizer(
predictions,
return_attention_mask=True,
max_length=self._sequence_length,
truncation=True,
padding="max_length",
)

encoded_texts = encodings["input_ids"]
attention_masks = encodings["attention_mask"]

# split input_text into non-overlapping batches of `batch_size`
for start_index in tqdm(range(0, len(encoded_texts), self._batch_size)):
end_index = min(start_index + self._batch_size, len(encoded_texts))
encoded_batch = encoded_texts[start_index:end_index]
attention_mask = attention_masks[start_index:end_index]

out = self._pipeline(sequences=predictions, return_logits=True)
logits = out.logits

labels = encoded_batch

# shift logits and labels create the input and target for the loss function
shift_logits = logits[:, :-1, :]
shift_labels = numpy.stack(labels)[:, 1:]
shift_attention_mask_batch = numpy.stack(attention_mask)[:, 1:]

# compute perplexity for this batch
perplexity_batch = torch.exp(
(
self._loss_fct(
torch.tensor(shift_logits.transpose(0, 2, 1)),
torch.tensor(shift_labels),
)
* torch.tensor(shift_attention_mask_batch)
).sum(1)
/ torch.tensor(shift_attention_mask_batch).sum(1)
)
self.perplexities.extend(perplexity_batch.numpy().tolist())

def compute(self) -> Dict[str, Any]:
return {
"mean_perplexity": numpy.mean(self.perplexities),
"perplexities": self.perplexities,
}


class PrecisionRecallF1:
def __init__(self, id_to_label: Optional[Dict[int, str]] = None):
self._id_to_label = id_to_label
Expand Down
12 changes: 9 additions & 3 deletions src/deepsparse/transformers/pipelines/text_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@ class TextGenerationPipeline(TransformersPipeline):
of tokens supplied even if the stop token is reached.
:param use_deepsparse_cache: if True, the pipeline will use the deepsparse kv cache
for caching the model outputs.
:param tokenizer_padding_side: the side to pad the input sequence to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline - running right padded for eval will likely not work for the engine (single token prefill) as internally they will build the KV cache assuming left padded and pop from the left side of cache as its built up. in right padded scenario I believe this will delete the actual non-padded values from cache too early.

Either "left" or "right". Defaults to "left".
:param kwargs: kwargs to pass to the TransformersPipeline
"""

Expand All @@ -101,6 +103,7 @@ def __init__(
prompt_processing_sequence_length: int = 128,
force_max_tokens: bool = False,
use_deepsparse_cache: bool = False,
tokenizer_padding_side: str = "left",
**kwargs,
):
if use_deepsparse_cache:
Expand All @@ -126,8 +129,7 @@ def __init__(
self.prompt_processing_sequence_length = prompt_processing_sequence_length
self.force_max_tokens = force_max_tokens

# override tokenizer to pad to left
self.tokenizer.padding_side = "left"
self.tokenizer.padding_side = tokenizer_padding_side

self.engine = None
self.multitoken_engine = NLDecoderEngine(
Expand Down Expand Up @@ -207,6 +209,8 @@ def process_inputs(self, inputs: TextGenerationInput) -> List[numpy.ndarray]:
return_tensors="np",
max_length=self.sequence_length,
padding="max_length",
# TODO: Truncating by default may be a problem
truncation=True,
)

attention_mask = input_tokens["attention_mask"]
Expand Down Expand Up @@ -240,7 +244,9 @@ def process_engine_outputs(
"""
generated_tokens, generated_logits = engine_outputs
sequences = self.tokenizer.batch_decode(
*generated_tokens, skip_special_tokens=True
# TODO: hack for now, make it general
*generated_tokens[0],
skip_special_tokens=True,
)
logits = generated_logits if kwargs.get("return_logits") else None

Expand Down