Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Batch API for inference job #182

Open
Tracked by #698
xinchen384 opened this issue Sep 16, 2024 · 2 comments
Open
Tracked by #698

[RFC]: Batch API for inference job #182

xinchen384 opened this issue Sep 16, 2024 · 2 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@xinchen384
Copy link
Contributor

Summary

This aims to expose batch API to users so that they can submit a batch job and retrieve job’s status and results anytime after job submission. However, current inference engines such as vllm, do not support such batch feature. This design is to fill in the gap between them.

Motivation

In order to support batch API for users doing batch inference jobs, our inference system needs to handle batch jobs' input and output and do time-based scheduling, both of which are not the scope of inference engine. In the following, we divide motivation into two parts: one part belongs to fundamental capabilities and the other part is for optimization to achieve better performance.

This part lists essential components to make E2E batch inference work. This is motivated by the need to

  • Enable storage for users' input and output.
  • Manage batch jobs' due time and other status information.
  • Schedule jobs FIFO, with time-based sliding window.
  • Guarantee at least once execution

With all basic capabilities becoming ready, this part focuses on performance improvement.

  • Fault tolerance and consistency
  • Scalable that supports a large number of batch jobs
  • Job scheduling. In order to maximize meeting jobs' due time, we can schedule jobs in a more efficient way instead of using FIFO. This is transparent to inference engine.
  • More fine-grained request scheduling, compatible with inference engines such as Pipeline parallelism.

Proposed Change

For the first part, this proposal builds several fundamental components to support OpenAI batch API.

  1. A persistent storage.
    (a). Store job's input and output. This works as persistent storage to serve users' requests for retrieval.
    (b). Interfaces for read/write, request input and output, job metadata
  2. Job metadata management
    (a). Handle jobs' state transition. This should clearly outline the transition diagram among different states.
    (b). Manage jobs' status, including job creation time, current status, scheduled resources and so on. T
    (c). Persistent on storage in terms of checkpoints. With this, users can directly retrieve jobs' status consistently.
  3. Job scheduling
    (a). Maintaining time-based sliding window of jobs. Based on job creation time, this slides the job window every minute.
    (b). do FIFO job scheduling and do request queries to inference engine. This will prepare all necessary input for inference engine.
    (c). Sync job's status. When received response from inference engine, this sync to and job window and metadata management.

The proposed changes listed in the second part of motivation are paused here for now. When we have a clear outline of the fundamentals, we should have a better understanding of tasks for optimization.

Alternatives Considered

No response

@xieus
Copy link
Collaborator

xieus commented Sep 19, 2024

Cool! Great RFC overall.

@Jeffwan Jeffwan added this to the v0.2.0 milestone Sep 25, 2024
@Jeffwan Jeffwan added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 25, 2024
@Jeffwan Jeffwan modified the milestones: v0.2.0, v0.3.0 Nov 19, 2024
@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 19, 2024

Remove from v0.2.0 release and move to v0.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants