-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Frontend] Fix Issues Under High Load With zeromq
Frontend
#7394
Merged
robertgshaw2-redhat
merged 88 commits into
vllm-project:main
from
neuralmagic:fix-zmq-max-sockets
Aug 21, 2024
Merged
Changes from 72 commits
Commits
Show all changes
88 commits
Select commit
Hold shift + click to select a range
b2e29a5
added proxy to limit use of uniz sockets
robertgshaw2-redhat 8d31115
Merge branch 'main' into fix-zmq-max-sockets
robertgshaw2-redhat 6d2b3df
comment
robertgshaw2-redhat c73e943
use random inproc path
robertgshaw2-redhat f1768fb
format
robertgshaw2-redhat 601a461
foamt
robertgshaw2-redhat 1a47d94
format
robertgshaw2-redhat eeecb09
Update vllm/entrypoints/openai/rpc/client.py
robertgshaw2-redhat 2770e40
cleaning
robertgshaw2-redhat 5a85618
Merge branch 'main' into fix-zmq-max-sockets
robertgshaw2-redhat 938db1d
Merge branch 'fix-zmq-max-sockets' of https://github.com/neuralmagic/…
robertgshaw2-redhat ea2f03e
remove logging
robertgshaw2-redhat 5cebc65
add info message re: concurrency
robertgshaw2-redhat 2c12436
update comment
robertgshaw2-redhat 9afd6ba
update
robertgshaw2-redhat c262088
format
robertgshaw2-redhat 3e580d5
reorder
robertgshaw2-redhat d9e10e0
reverT
robertgshaw2-redhat 4e3a63a
fix
robertgshaw2-redhat e54bf8a
fix
robertgshaw2-redhat 6544f3a
fix abort logic
robertgshaw2-redhat 81f4da8
reduce LOC change
robertgshaw2-redhat b3374bc
cleanup
robertgshaw2-redhat dd1817a
cleanup
robertgshaw2-redhat 5b56365
format
robertgshaw2-redhat 05ff816
fix client
robertgshaw2-redhat e551d30
revert unneccessary change
robertgshaw2-redhat 3d7f65f
revert startup probe changes to separate PR
robertgshaw2-redhat e7e6f1e
stash
robertgshaw2-redhat eaaebcc
Merge branch 'main' into fix-zmq-max-sockets
robertgshaw2-redhat 21b5239
stash draining
robertgshaw2-redhat 7e15b00
update
robertgshaw2-redhat 74c4166
stash
robertgshaw2-redhat 450e949
convert RPCServer to use DEALER
robertgshaw2-redhat 8348f1f
stash
robertgshaw2-redhat 545956e
fix
robertgshaw2-redhat 7a34611
cleaning
robertgshaw2-redhat 50abb94
stash
robertgshaw2-redhat 1723687
remove awk
robertgshaw2-redhat 3dfc9ef
nits
robertgshaw2-redhat 8d40f2d
format
robertgshaw2-redhat 3397460
format
robertgshaw2-redhat ef132dc
nit
robertgshaw2-redhat 10ef204
change
robertgshaw2-redhat b67718f
clean
robertgshaw2-redhat c3c1dbe
Update vllm/entrypoints/openai/rpc/server.py
robertgshaw2-redhat ee6efcf
format
robertgshaw2-redhat 3fdc2fe
cleanup abort logic
robertgshaw2-redhat 4cacb56
nit
robertgshaw2-redhat 724eb31
added load test
robertgshaw2-redhat 4d5e6b7
update load test
robertgshaw2-redhat b9e4168
updated
robertgshaw2-redhat 8f9bc23
format
robertgshaw2-redhat 9a2be3f
updated
robertgshaw2-redhat dee38f0
revert suurious change
robertgshaw2-redhat e78f443
convert to even smaller model
robertgshaw2-redhat cc2d7db
20k requests
robertgshaw2-redhat b40e269
convert to 10k requests
robertgshaw2-redhat 03eed9c
clean up closing logic
robertgshaw2-redhat f697226
use constant
robertgshaw2-redhat fd642ab
fix bad cleanup
robertgshaw2-redhat 762c2ed
remove useless argument
robertgshaw2-redhat c805ed2
up to 20k requests
robertgshaw2-redhat 2e1652e
revert to 10k requests
robertgshaw2-redhat 3e1ede4
revert suprious argument
robertgshaw2-redhat b3bf7ef
revert to 20k
robertgshaw2-redhat 708bd34
format
robertgshaw2-redhat 10a88ec
[BugFix] Raise all exception variations in async generator
njhill db8aebc
Fix possible premature generator completion; add tests
njhill b16c64b
format
robertgshaw2-redhat a9ecaa9
added test accuracy
robertgshaw2-redhat 6f8d5e8
format
robertgshaw2-redhat bab177f
updated test pipeline
robertgshaw2-redhat 7b58281
fix lm eval
robertgshaw2-redhat adf45d1
cleanup
robertgshaw2-redhat 9e827b0
updated
robertgshaw2-redhat 47dca36
Merge branch 'main' into fix-zmq-max-sockets
robertgshaw2-redhat f84c341
added sleep time
robertgshaw2-redhat 0ce78f8
actually sleep
robertgshaw2-redhat 8054348
formatting
robertgshaw2-redhat 5ddbdab
format
robertgshaw2-redhat 1ebbe9e
mypy
robertgshaw2-redhat 53d639b
mypy
robertgshaw2-redhat a36b381
format
robertgshaw2-redhat 415ee39
remove test load
robertgshaw2-redhat 26440e6
stash
robertgshaw2-redhat 2442a9d
Merge branch 'fix-zmq-max-sockets' of https://github.com/neuralmagic/…
robertgshaw2-redhat b72f84f
Merge branch 'fix-raise-cancelled' into fix-zmq-max-sockets
robertgshaw2-redhat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
""" | ||
This file test accuracy of the vLLM server via LMEval. | ||
It uses local-completions, which interacts with vLLM | ||
through the OAI API with N concurrent connections. | ||
This simulates real work usage of the API and makes | ||
sure that the zmq frontend mp RPC message passing and | ||
AsyncLLMEngine are working correctly. | ||
""" | ||
|
||
import lm_eval | ||
import pytest | ||
|
||
from ...utils import RemoteOpenAIServer | ||
|
||
MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct" | ||
NUM_CONCURRENT = 500 | ||
TASK = "gsm8k" | ||
FILTER = "exact_match,strict-match" | ||
RTOL = 0.03 | ||
EXPECTED_VALUE = 0.58 | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def server(): | ||
args = [ | ||
"--max-model-len", "4096", "--enable-chunked-prefill", | ||
"--disable-log-requests", "--enforce-eager" | ||
] | ||
|
||
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: | ||
yield remote_server | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def server_data(server): | ||
return { | ||
"url": f"{server.url_for('v1')}/completions", | ||
} | ||
|
||
|
||
def test_lm_eval_accuracy(server_data): | ||
model_args = (f"model={MODEL_NAME}," | ||
f"base_url={server_data['url']}," | ||
f"num_concurrent={NUM_CONCURRENT},tokenized_requests=False") | ||
|
||
results = lm_eval.simple_evaluate( | ||
model="local-completions", | ||
model_args=model_args, | ||
tasks=TASK, | ||
) | ||
|
||
measured_value = results["results"][TASK][FILTER] | ||
assert (measured_value - RTOL < EXPECTED_VALUE | ||
and measured_value + RTOL > EXPECTED_VALUE | ||
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
""" | ||
This file tests significant load on the vLLM server. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @simon-mo --- this test takes ~3 minutes on H100 Will likely take >10 min on L4 ... are you okay with this? |
||
Inside vLLM, we use a zeromq based RPC protocol | ||
to enable multiprocessing w/ the API server and | ||
the AsyncLLMEngine to avoid GIL conflict. | ||
|
||
This test confirms that even at high load with many | ||
concurrent requests, zmq does not drop any messages. | ||
""" | ||
|
||
import asyncio | ||
import json | ||
|
||
import aiohttp | ||
import pytest | ||
|
||
from ...utils import RemoteOpenAIServer | ||
|
||
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60) | ||
|
||
MODEL_NAME = "Qwen/Qwen2-0.5B-Instruct" | ||
NUM_REQUESTS = 20000 | ||
MAX_TOKENS = 50 | ||
MESSAGES = [{ | ||
"role": "system", | ||
"content": "you are a helpful assistant" | ||
}, { | ||
"role": "user", | ||
"content": "The meaning of life is" | ||
}] | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def server(): | ||
args = [ | ||
"--max-model-len", "4096", "--enable-chunked-prefill", | ||
"--disable-log-requests", "--enforce-eager" | ||
] | ||
|
||
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: | ||
yield remote_server | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def server_data(server): | ||
return { | ||
"url": f"{server.url_for('v1')}/chat/completions", | ||
"api_key": server.DUMMY_API_KEY | ||
} | ||
|
||
|
||
# Cannot use Async OpenAIClient due to limitations in maximum | ||
# number of concurrent requests that can be sent to the server | ||
# from the client. | ||
async def async_openai_chat(model_name, url, api_key): | ||
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: | ||
payload = { | ||
"model": model_name, | ||
"messages": MESSAGES, | ||
"temperature": 0.0, | ||
"max_tokens": MAX_TOKENS, | ||
"stream": False, | ||
} | ||
headers = { | ||
"Content-Type": "application/json", | ||
"Authorization": f"Bearer {api_key}" | ||
} | ||
|
||
async with session.post(url=url, json=payload, | ||
headers=headers) as response: | ||
assert response.status == 200 | ||
# data = json.loads(response.text) | ||
data = json.loads(await response.text()) | ||
completion_tokens = data["usage"]["completion_tokens"] | ||
text = data["choices"][0]["message"] | ||
|
||
return (completion_tokens, text) | ||
|
||
|
||
async def get_request(model_name, url, api_key): | ||
for _ in range(NUM_REQUESTS): | ||
yield async_openai_chat(model_name, url, api_key) | ||
|
||
|
||
@pytest.mark.asyncio | ||
@pytest.mark.parametrize( | ||
"model_name", | ||
[MODEL_NAME], | ||
) | ||
async def test_load(server_data, model_name): | ||
# Make requests to the server. | ||
tasks = [] | ||
async for request in get_request(model_name, server_data["url"], | ||
server_data["api_key"]): | ||
tasks.append(asyncio.create_task(request)) | ||
outputs = await asyncio.gather(*tasks) | ||
|
||
# Check that each client generated exactly 50 tokens. | ||
# If this is true, then we are not seeing any message dropping in zeromq. | ||
for idx, (completion_tokens, text) in enumerate(outputs): | ||
assert completion_tokens == MAX_TOKENS, ( | ||
f"Request {idx}: Expected {MAX_TOKENS} completion tokens but " | ||
f"found only {completion_tokens} were generated. " | ||
f"zeromq multiprocessing frontend is likely dropping messages. " | ||
f"Full text:\n\n\n {text}") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,6 +131,9 @@ async def build_async_engine_client(args) -> AsyncIterator[AsyncEngineClient]: | |
logger.info("Multiprocessing frontend to use %s for RPC Path.", | ||
rpc_path) | ||
|
||
# Build RPCClient, which conforms to AsyncEngineClient Protocol. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. moved first, since we
|
||
async_engine_client = AsyncEngineRPCClient(rpc_path) | ||
|
||
# Start RPCServer in separate process (holds the AsyncLLMEngine). | ||
context = multiprocessing.get_context("spawn") | ||
# the current process might have CUDA context, | ||
|
@@ -141,8 +144,6 @@ async def build_async_engine_client(args) -> AsyncIterator[AsyncEngineClient]: | |
rpc_server_process.start() | ||
logger.info("Started engine process with PID %d", | ||
rpc_server_process.pid) | ||
# Build RPCClient, which conforms to AsyncEngineClient Protocol. | ||
async_engine_client = AsyncEngineRPCClient(rpc_path) | ||
|
||
try: | ||
while True: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to install from source since
local-completions
api with support for concurrent requests is not yet in release oflm_eval