-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat template: return vectorized output in processors #34275
Chat template: return vectorized output in processors #34275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Just to clarify, the idea is that if you pass a chat to apply_chat_template
and some of the content
fields contain images or videos, and tokenize=True
, then images and videos are loaded and processed, so that the output is ready to pass to the model?
Yep, exactly! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
The main question that is not clear to me is: what is the backend selection strategy? Should we pass it explicitly? I see load_video
is used without passing backend
and num_frames
arguments
welcome back @qubvel ! Okey, I'll add more typehints and better docs. The backend should be selectable by the user, but we default to the one that works in all cases and has no weird cuda related failures. Prob we should document this somewhere, but I didn't yet find a good place for it |
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, some nits!
if video.startswith("https://www.youtube.com") or video.startswith("http://www.youtube.com"): | ||
if not is_yt_dlp_available(): | ||
raise ImportError("To load a video from YouTube url you have to install `yt_dlp` first.") | ||
buffer = BytesIO() | ||
with redirect_stdout(buffer), YoutubeDL() as f: | ||
f.download([video]) | ||
bytes_obj = buffer.getvalue() | ||
file_obj = BytesIO(bytes_obj) | ||
elif video.startswith("http://") or video.startswith("https://"): | ||
file_obj = BytesIO(requests.get(video).content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional kwargs might be required here, e.g. timeout, but probably fine for now
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Huh, I don't know why it requested review from some many people, feel free to unsubscribe, sorry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me super good in terms of API!
I am mostly wondering if this does not pose security threats as we are opening links vs before the user had to open the link explicitly in his code.
Hmm, good point about the security. We actually already have a few processors that open links for you, e.g. Idefics and Pixtral. Haven't seen anyone flag it as a security breach so maybe it's not a big deal? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it or put it in the benchmark file, but probably an overkill!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this file seems unrelated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one as well
if is_decord_available(): | ||
from decord import VideoReader, cpu | ||
|
||
if is_av_available(): | ||
import av | ||
|
||
if is_cv2_available(): | ||
import cv2 | ||
|
||
if is_yt_dlp_available(): | ||
from yt_dlp import YoutubeDL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block breaks lazy importing of cv2
which vllm
strictly enforces. It happens when vLLM imports from transformers.image_utils import ImageInput
. vLLM cannot upgrade to v4.49.0 because of it vllm-project/vllm#13905.
Would it be possible to delay this import? This would be preferable to lazily importing ImageInput
everywhere it's used in vLLM.
What does this PR do?
Part of #33948. This PR adds support for
return_tensors="pt"
when calling chat templates for processors. That way users can obtain inputs in tensor format and pass it directly to the model, instead of having to call processor with a formatted prompt + visuals.For images we use the existing functionality
load_images
and for videos I added a few functions. We usually useav
in all video related model docs sincedecord
had problems with CUDA in the past. Apart from that we can useopencv
ortorchvision
for video loading. I did a small benchmark run to load and sample uniformly 32 frames from around ~100 videos andav
was the slowest of all them, whiledecord
was the fastest. Therefore I decided to add helper with all possible backends and let users switch whenever they want to. By default we useopencv
as it is a more common CV framework than any others provided here.In the future we might start using
torchvision
when we addVideoProcessor
class and supportVideoProcessorFast
(see #33504).These are the results of small benchmarking with ~100 videos:
Review from @Rocketknight1 for templates and @qubvel for general CV related modifications.