Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat template: return vectorized output in processors #34275

Merged
merged 42 commits into from
Jan 10, 2025

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

Part of #33948. This PR adds support for return_tensors="pt" when calling chat templates for processors. That way users can obtain inputs in tensor format and pass it directly to the model, instead of having to call processor with a formatted prompt + visuals.

For images we use the existing functionality load_images and for videos I added a few functions. We usually use av in all video related model docs since decord had problems with CUDA in the past. Apart from that we can use opencv or torchvision for video loading. I did a small benchmark run to load and sample uniformly 32 frames from around ~100 videos and av was the slowest of all them, while decord was the fastest. Therefore I decided to add helper with all possible backends and let users switch whenever they want to. By default we use opencv as it is a more common CV framework than any others provided here.

In the future we might start using torchvision when we add VideoProcessor class and support VideoProcessorFast (see #33504).

These are the results of small benchmarking with ~100 videos:

# Time taken for decord: 475.2979 sec
# Time taken for opencv: 614.6062 sec
# Time taken for av: 1067.0860 sec
# Time taken for torchvision: 1924.0433 sec

Review from @Rocketknight1 for templates and @qubvel for general CV related modifications.

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Just to clarify, the idea is that if you pass a chat to apply_chat_template and some of the content fields contain images or videos, and tokenize=True, then images and videos are loaded and processed, so that the output is ready to pass to the model?

@zucchini-nlp
Copy link
Member Author

the idea is that if you pass a chat to apply_chat_template and some of the content fields contain images or videos, and tokenize=True, then images and videos are loaded and processed, so that the output is ready to pass to the model?

Yep, exactly!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

The main question that is not clear to me is: what is the backend selection strategy? Should we pass it explicitly? I see load_video is used without passing backend and num_frames arguments

@zucchini-nlp
Copy link
Member Author

welcome back @qubvel ! Okey, I'll add more typehints and better docs. The backend should be selectable by the user, but we default to the one that works in all cases and has no weird cuda related failures. Prob we should document this somewhere, but I didn't yet find a good place for it

@zucchini-nlp zucchini-nlp requested a review from qubvel October 29, 2024 18:05
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some nits!

Comment on lines +559 to +568
if video.startswith("https://www.youtube.com") or video.startswith("http://www.youtube.com"):
if not is_yt_dlp_available():
raise ImportError("To load a video from YouTube url you have to install `yt_dlp` first.")
buffer = BytesIO()
with redirect_stdout(buffer), YoutubeDL() as f:
f.download([video])
bytes_obj = buffer.getvalue()
file_obj = BytesIO(bytes_obj)
elif video.startswith("http://") or video.startswith("https://"):
file_obj = BytesIO(requests.get(video).content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional kwargs might be required here, e.g. timeout, but probably fine for now

zucchini-nlp and others added 6 commits October 30, 2024 10:01
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
@zucchini-nlp
Copy link
Member Author

Huh, I don't know why it requested review from some many people, feel free to unsubscribe, sorry

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me super good in terms of API!
I am mostly wondering if this does not pose security threats as we are opening links vs before the user had to open the link explicitly in his code.

@zucchini-nlp
Copy link
Member Author

Hmm, good point about the security. We actually already have a few processors that open links for you, e.g. Idefics and Pixtral. Haven't seen anyone flag it as a security breach so maybe it's not a big deal?

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

zucchini-nlp and others added 5 commits January 10, 2025 10:31
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@zucchini-nlp zucchini-nlp merged commit e0646f3 into huggingface:main Jan 10, 2025
25 checks passed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it or put it in the benchmark file, but probably an overkill!

ArthurZucker added a commit that referenced this pull request Jan 10, 2025
Copy link
Member

@qubvel qubvel Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this file seems unrelated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one as well

Comment on lines +76 to +86
if is_decord_available():
from decord import VideoReader, cpu

if is_av_available():
import av

if is_cv2_available():
import cv2

if is_yt_dlp_available():
from yt_dlp import YoutubeDL
Copy link
Member

@hmellor hmellor Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block breaks lazy importing of cv2 which vllm strictly enforces. It happens when vLLM imports from transformers.image_utils import ImageInput. vLLM cannot upgrade to v4.49.0 because of it vllm-project/vllm#13905.

Would it be possible to delay this import? This would be preferable to lazily importing ImageInput everywhere it's used in vLLM.

cc @ArthurZucker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants