Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Video Llava #29733

Merged
merged 57 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
dce6678
add model draft
zucchini-nlp Mar 19, 2024
72626df
update docstring
zucchini-nlp Mar 20, 2024
8cca731
add tests
zucchini-nlp Mar 20, 2024
4ea4f70
support image and video as input
zucchini-nlp Mar 20, 2024
c36819d
update for better handling of mixed input and clean-up a bit
zucchini-nlp Mar 21, 2024
c1a8fd5
bug when mixed inputs & add tests
zucchini-nlp Apr 8, 2024
c591c75
Update README.md
zucchini-nlp Apr 8, 2024
5ff8d18
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp Apr 8, 2024
a6bc68d
link to abstract of paper in README
zucchini-nlp Apr 8, 2024
eb309ed
fix test
zucchini-nlp Apr 8, 2024
2f46f6c
fix-copies
zucchini-nlp Apr 8, 2024
6b51b7e
Merge branch 'main' into video_llava
zucchini-nlp Apr 8, 2024
e112958
make tests happy
zucchini-nlp Apr 8, 2024
5cb6163
skip docstest for now
zucchini-nlp Apr 10, 2024
930147d
do not run doctest for now
zucchini-nlp Apr 18, 2024
24ec2b3
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp Apr 18, 2024
142bfc0
Update src/transformers/models/video_llava/processing_video_llava.py
zucchini-nlp Apr 22, 2024
fdec895
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
e83251c
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
4fcfe72
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
327030d
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
33289a5
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 22, 2024
dfef75a
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
ebf1042
address review comments
zucchini-nlp Apr 22, 2024
aa1b278
failing tests
zucchini-nlp Apr 22, 2024
7802922
Fix vocab_size in common tests for VLMs
zucchini-nlp Apr 23, 2024
9fce414
codestyle
zucchini-nlp Apr 23, 2024
e8b4569
Merge branch 'huggingface:main' into video_llava
zucchini-nlp Apr 23, 2024
bb1cc26
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp Apr 29, 2024
e2e92b2
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp Apr 29, 2024
5c77fff
Update src/transformers/models/video_llava/modeling_video_llava.py
zucchini-nlp Apr 29, 2024
99518cb
Update src/transformers/models/video_llava/modeling_video_llava.py
zucchini-nlp Apr 29, 2024
451fd72
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
95a9a01
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
347fa8c
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 30, 2024
3e2f1b4
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
3cd1222
Update src/transformers/models/video_llava/processing_video_llava.py
zucchini-nlp Apr 30, 2024
242703a
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
9c1a10d
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
b4145e1
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
5803d5a
PR suggestions
zucchini-nlp Apr 30, 2024
975d959
fix-copies
zucchini-nlp Apr 30, 2024
7f30e3b
Merge branch 'main' into video_llava
zucchini-nlp Apr 30, 2024
6bdad81
Merge branch 'huggingface:main' into video_llava
zucchini-nlp May 1, 2024
a817f31
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp May 8, 2024
dba80e2
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp May 8, 2024
6b3eafb
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp May 8, 2024
ba4e125
add full example in docs
zucchini-nlp May 8, 2024
6cc8af1
clean-up with new model-id
zucchini-nlp May 10, 2024
885a5ae
[run-slow] video_llava
zucchini-nlp May 10, 2024
377aafe
update docstring
zucchini-nlp May 10, 2024
637b197
Merge branch 'main' into video_llava
zucchini-nlp May 10, 2024
a411347
[run-slow] video_llava
zucchini-nlp May 10, 2024
0d83eaf
Merge branch 'huggingface:main' into video_llava
zucchini-nlp May 14, 2024
8134039
remove all achive maps
zucchini-nlp May 15, 2024
8e15514
fix some tests
zucchini-nlp May 15, 2024
5d1e976
test was supposed to be skipped for llava :)
zucchini-nlp May 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions src/transformers/models/video_llava/configuration_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,10 @@ class VideoLlavaConfig(PretrainedConfig):

Args:
vision_config (`VideoLlavaVisionConfig`, *optional*):
Custom vision config or dict
Custom vision config or dict. Defaults ot `CLIPVisionConfig` if not indicated.
text_config (`Union[AutoConfig, dict]`, *optional*):
The config object of the text backbone. Can be any of `LlamaConfig` or `MistralConfig`.
Defaults ot `LlamaConfig` if not indicated.
ignore_index (`int`, *optional*, defaults to -100):
The ignore index for the loss function.
image_token_index (`int`, *optional*, defaults to 32000):
Expand Down Expand Up @@ -101,7 +102,9 @@ def __init__(
self.vision_config = vision_config

if isinstance(self.vision_config, dict):
vision_config["model_type"] = vision_config.get("model_type", "clip_vision_model")
if "model_type" not in vision_config:
vision_config["model_type"] = "clip_vision_model"
logger.warning("Key=`model_type` not found in vision config, setting it to `clip_vision_model`")
self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
elif vision_config is None:
self.vision_config = CONFIG_MAPPING["clip_vision_model"](
Expand All @@ -115,12 +118,13 @@ def __init__(
projection_dim=768,
)

self.text_config = text_config

if isinstance(self.text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
if isinstance(text_config, dict):
if "model_type" not in text_config:
text_config["model_type"] = "llama"
logger.warning("Key=`model_type` not found in text config, setting it to `llama`")
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
self.text_config = CONFIG_MAPPING["llama"]()
text_config = CONFIG_MAPPING["llama"]()

self.text_config = text_config
super().__init__(**kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add tests for the image processor - in particular to test that it correctly handles just images, just videos and image + video inputs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests, but there is one thing to note. If we call directly the ImageProcessor class, it requires and argument images to be present. A workaround is to pass explicitly images=None for VideoLlavaImageProcessor, which I did for the tests.

I can override call and to make the argument images = None. so that it is optional, but not sure how good is overriding call. Also, I do not think many ppl call image processor explicitly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image processor takes both images and videos as input, and only one of them is required, then setting image = None seems reasonable

Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ def __init__(
self.do_convert_rgb = do_convert_rgb
self._valid_processor_keys = [
"images",
"videos",
"do_resize",
"size",
"resample",
Expand Down Expand Up @@ -206,8 +207,8 @@ def resize(

def preprocess(
self,
images: List[ImageInput],
videos: List[VideoInput],
images: List[ImageInput] = None,
videos: List[VideoInput] = None,
do_resize: bool = None,
size: Dict[str, int] = None,
resample: PILImageResampling = None,
Expand All @@ -228,9 +229,12 @@ def preprocess(
Preprocess an image or batch of images.

Args:
visual_inputs (`ImageInput`):
List of images and/or videos to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
images (`ImageInput`, *optional*):
List of images to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
videos (`VideoInput`, *optional*):
List of videos to preprocess. Expects a single or batch of videos with pixel values ranging from 0 to 255. If
passing in videos with pixel values between 0 and 1, set `do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Expand Down Expand Up @@ -326,7 +330,7 @@ def preprocess(
]
for video in videos
]
data["pixel_values_video"] = pixel_values_video
data["pixel_values_videos"] = pixel_values_videos

if images is not None:
pixel_values_images = [
Expand Down
8 changes: 8 additions & 0 deletions src/transformers/models/video_llava/modeling_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,11 @@ def _supports_sdpa(self):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
vision_feature_layer (`int`, *optional*, defaults to -2):
The index of the layer to select the vision feature.
vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
The feature selection strategy used to select the vision feature from the vision backbone.
Can be one of `"default"` or `"full"`
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
Expand Down Expand Up @@ -372,6 +377,9 @@ def _get_vision_features(
# videos do not need to select features and it's always "full" (as it is done in the orig implementation)
if pixel_values_videos is not None:
batch_size_vid, num_frames, channels, height, width = pixel_values_videos.shape
if num_frames != 8:
raise ValueError(f"Video pixel values should have exactly `8` frames but foung `{num_frames}`")

pixel_values = pixel_values_videos.reshape(batch_size_vid * num_frames, channels, height, width)
video_outputs = self.video_tower(pixel_values, output_hidden_states=True)
video_outputs = video_outputs.hidden_states[vision_feature_layer].squeeze(1)
Expand Down
9 changes: 4 additions & 5 deletions src/transformers/models/video_llava/processing_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,11 +112,10 @@ def __call__(
encoded_images = self.image_processor(images=images, videos=videos, return_tensors=return_tensors)
data.update(encoded_images)

if text is not None:
text_inputs = self.tokenizer(
text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
)
data.update(text_inputs)
text_inputs = self.tokenizer(
text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
)
data.update(text_inputs)

return BatchFeature(data=data)

Expand Down
Loading