Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serving][python] Support non 200 HTTP response codes for non-streami… #2173

Merged
merged 1 commit into from
Jul 16, 2024

Conversation

siddvenk
Copy link
Contributor

@siddvenk siddvenk commented Jul 13, 2024

…ng rolling-batch

  1. Change non-streaming output formatters to only generate output on final token (valid and error cases)
  2. Update InferenceRequestHandler to only publish HTTP Response when data is actually received
  3. Update RollingBatch to properly set status code in non-streaming use-cases

Note:

  • The ask for supporting non 200 HTTP responses has been asked for be a few users.

TODO:

  • I need to update unit tests, but i have done extensive local testing and am comfortable with the core changes raised here. I'll change from WIP to Open once i complete the unit tests.

This change is motivated by two things:

  • Make output formatters for non streaming use-cases easier to deal with, and more readable
  • Provide non-200 HTTP status code in the non-streaming error cases.

The output formatters for non-streaming use-cases are currently pretty hard to reason about. Beyond just making it easier to read and work with, this also fixes an issue:

  • If there is an error during generation after we have generated a few tokens, the current behavior results in a response that is not valid json. This change ensure that the output formatters we provide always result in valid json content in mid-generation error scenarios

The second part of this change is to provide non-200 status codes during non-streaming use-cases when an error occurs during inference. This is only possible with the first change to the output formatters. There are two small changes in InferenceRequestHandler and RollingBatch to enable this.

@siddvenk siddvenk force-pushed the rb-non-streaming branch 4 times, most recently from 0ef29b9 to a05c246 Compare July 15, 2024 19:46
@siddvenk siddvenk marked this pull request as ready for review July 15, 2024 22:22
@siddvenk siddvenk requested review from zachgk, frankfliu and a team as code owners July 15, 2024 22:22
@siddvenk siddvenk marked this pull request as draft July 15, 2024 23:50
@siddvenk siddvenk force-pushed the rb-non-streaming branch 4 times, most recently from 4fdf2de to abb0cf8 Compare July 16, 2024 03:12
@siddvenk siddvenk marked this pull request as ready for review July 16, 2024 03:16
@@ -98,7 +98,7 @@ def rolling_batch_inference(parsed_input, inputs: Input, outputs: Output,
outputs.add(Output.binary_encode(err), key="data", batch_index=i)
outputs.add_property(f"batch_{i}_Content-Type", "application/json")
else:
content_type = result[idx].pop("content_type")
content_type = result[idx].get("content_type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if .pop() causing any issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use get here. Maybe in the future someone would access the content in the later time. Python GC would collect it eventually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pop does not cause issues currently, but it's incorrectly used here with the subsequent check to see if content-type is None.

Either we use pop and get rid of that check, or use get here.

@@ -628,7 +607,7 @@ def test_chat_json(self):
},
'finish_reason': 'length'
}]
assert final_json['usage'] == {
assert output['usage'] == {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side topic, noticed we use assert in production code quite a bit. I'm not a fan of assert, we should rethink about usage of assert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this is the test code but we should not be using assert in the production code.

I think keeping assert in the test code is fine - separately i'm exploring a migration to pytest instead of unittest to get some better test outputs and codecoverage. pytest only supports assert

req.set_next_token(Token(4558, " world", -0.567854), True,
'length')
print(req.get_next_token(), end='')
assert req.get_next_token() == ' world"}'
req.reset_next_token()
assert req.get_next_token() == json.dumps(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertEquals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me take this up as a separate change. We don't have consistent usage of assert/self.assert across our tests, so i can fix on that across our tests

@@ -306,24 +304,6 @@ void addResponse(byte[] json, Map<String, String> properties) {
output.getProperties().putAll(properties);
}
++count;
if (json[0] == '{') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may potentially break some legacy custom handler. Maybe mark as deprecate and raise Deprecation warning for this code piece execution for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if this condition is still valid, but i'll add it back with a deprecation warning.

We removed the huggingface.parse_input and replaced it with the input formatter @sindhuvahinis implemented.

I think this is only valid if user does json encoding on the outputs from python side that are received on java side

…g-batch

1. Change non-streaming output formatters to only generate output on final token (valid and error cases)
3. Update RollingBatch to set status code in non-streaming use-cases
@siddvenk siddvenk merged commit a8ac267 into deepjavalibrary:master Jul 16, 2024
9 checks passed
@siddvenk siddvenk deleted the rb-non-streaming branch July 16, 2024 17:09
davidthomas426 added a commit to davidthomas426/djl-serving that referenced this pull request Oct 18, 2024
…pport for non-200 HTTP codes for non-streaming rolling batch requests
siddvenk added a commit that referenced this pull request Oct 19, 2024
#2466)

Co-authored-by: Siddharth Venkatesan <siddhave@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants