[RFC]: Batch API for inference job #182
Labels
kind/feature
Categorizes issue or PR as related to a new feature.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone
Summary
This aims to expose batch API to users so that they can submit a batch job and retrieve job’s status and results anytime after job submission. However, current inference engines such as vllm, do not support such batch feature. This design is to fill in the gap between them.
Motivation
In order to support batch API for users doing batch inference jobs, our inference system needs to handle batch jobs' input and output and do time-based scheduling, both of which are not the scope of inference engine. In the following, we divide motivation into two parts: one part belongs to fundamental capabilities and the other part is for optimization to achieve better performance.
This part lists essential components to make E2E batch inference work. This is motivated by the need to
With all basic capabilities becoming ready, this part focuses on performance improvement.
Proposed Change
For the first part, this proposal builds several fundamental components to support OpenAI batch API.
(a). Store job's input and output. This works as persistent storage to serve users' requests for retrieval.
(b). Interfaces for read/write, request input and output, job metadata
(a). Handle jobs' state transition. This should clearly outline the transition diagram among different states.
(b). Manage jobs' status, including job creation time, current status, scheduled resources and so on. T
(c). Persistent on storage in terms of checkpoints. With this, users can directly retrieve jobs' status consistently.
(a). Maintaining time-based sliding window of jobs. Based on job creation time, this slides the job window every minute.
(b). do FIFO job scheduling and do request queries to inference engine. This will prepare all necessary input for inference engine.
(c). Sync job's status. When received response from inference engine, this sync to and job window and metadata management.
The proposed changes listed in the second part of motivation are paused here for now. When we have a clear outline of the fundamentals, we should have a better understanding of tasks for optimization.
Alternatives Considered
No response
The text was updated successfully, but these errors were encountered: