Add ViTPose #30530

NielsRogge · 2024-04-28T20:01:37Z

What does this PR do?

This PR adds ViTPose as introduced in ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.

Here's a demo notebook - note that the API might change:

https://colab.research.google.com/drive/15_3gjcC0wtKSH85k76zewt81eUJIEWWA?usp=sharing.
https://colab.research.google.com/drive/1e8fcby5rhKZWcr9LSN8mNbQ0TU4Dxxpo?usp=sharing (supervision visualization)

To do:

get rid of cv2 dependency (?)

qubvel · 2024-11-22T17:39:00Z

@ArthurZucker please review whenever you have time!

The notebook to run: https://colab.research.google.com/drive/1e8fcby5rhKZWcr9LSN8mNbQ0TU4Dxxpo?usp=sharing

Checkpoints:
https://huggingface.co/danelcsb/vitpose-base-simple
https://huggingface.co/danelcsb/vitpose-base
https://huggingface.co/danelcsb/vitpose-base-coco-aic-mpii
https://huggingface.co/danelcsb/vitpose-plus-base

SangbumChoi · 2024-12-03T09:44:32Z

@ArthurZucker Reminder ping :)

ArthurZucker

Wondering if you are certain we need to split these. I think we can just add all in vitpose no? Why do we need to have the backbone separate?
Otherwise good work!

src/transformers/models/vitpose/__init__.py

ArthurZucker · 2024-12-13T10:35:08Z

src/transformers/models/vitpose/convert_vitpose_to_hf.py

+    # create image processor
+    image_processor = VitPoseImageProcessor()
+
+    # verify image processor
+    image = prepare_img()
+    boxes = [[[412.8, 157.61, 53.05, 138.01], [384.43, 172.21, 15.12, 35.74]]]
+    pixel_values = image_processor(images=image, boxes=boxes, return_tensors="pt").pixel_values
+
+    filepath = hf_hub_download(repo_id="nielsr/test-image", filename="vitpose_batch_data.pt", repo_type="dataset")
+    original_pixel_values = torch.load(filepath, map_location="cpu")["img"]
+    assert torch.allclose(pixel_values, original_pixel_values, atol=1e-1)
+
+    dataset_index = torch.tensor([0])
+
+    with torch.no_grad():
+        # first forward pass
+        outputs = model(pixel_values, dataset_index=dataset_index)
+        output_heatmap = outputs.heatmaps
+
+        # second forward pass (flipped)
+        # this is done since the model uses `flip_test=True` in its test config
+        pixel_values_flipped = torch.flip(pixel_values, [3])
+        outputs_flipped = model(
+            pixel_values_flipped,
+            dataset_index=dataset_index,
+            flip_pairs=torch.tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16]]),
+        )
+        output_flipped_heatmap = outputs_flipped.heatmaps
+
+    outputs.heatmaps = (output_heatmap + output_flipped_heatmap) * 0.5
+
+    # Verify pose_results
+    pose_results = image_processor.post_process_pose_estimation(outputs, boxes=boxes)[0]
+
+    if model_name == "vitpose-base-simple":
+        assert torch.allclose(
+            pose_results[1]["keypoints"][0],
+            torch.tensor([3.98180511e02, 1.81808380e02]),
+            atol=5e-2,
+        )
+        assert torch.allclose(
+            pose_results[1]["scores"][0],
+            torch.tensor([8.66642594e-01]),
+            atol=5e-2,
+        )
+    elif model_name == "vitpose-base":
+        assert torch.allclose(
+            pose_results[1]["keypoints"][0],
+            torch.tensor([3.9807913e02, 1.8182812e02]),
+            atol=5e-2,
+        )
+        assert torch.allclose(
+            pose_results[1]["scores"][0],
+            torch.tensor([8.8235235e-01]),
+            atol=5e-2,
+        )
+    elif model_name == "vitpose-base-coco-aic-mpii":
+        assert torch.allclose(
+            pose_results[1]["keypoints"][0],
+            torch.tensor([3.98305542e02, 1.81741592e02]),
+            atol=5e-2,
+        )
+        assert torch.allclose(
+            pose_results[1]["scores"][0],
+            torch.tensor([8.69966745e-01]),
+            atol=5e-2,
+        )
+    elif model_name == "vitpose-plus-base":
+        assert torch.allclose(
+            pose_results[1]["keypoints"][0],
+            torch.tensor([3.98201294e02, 1.81728302e02]),
+            atol=5e-2,
+        )
+        assert torch.allclose(
+            pose_results[1]["scores"][0],
+            torch.tensor([8.75046968e-01]),
+            atol=5e-2,
+        )
+    else:
+        raise ValueError("Model not supported")
+    print("Conversion successfully done.")


as always let's put this in tests rather than here. it's not part of the conversion

Do you mean erase in this convert.py? and put it in the test? or put both here and test?

Yes, let's erase it here and leave it only in tests

I'm not a fan of this, I'd always put these in conversion scripts.... especially since we are still tweaking the architecture and things can break + the number of users using the ViTPose conversion script afterwards is a number very close to 0.

another reason is that you have a single source of truth to showcase to people who want to know how we obtained equivalent logits.

By just adding an integration test, you lose all that information, causing breaking changes such as this one. For models like Pixtral but also SAM, it's unclear to me how equivalent logits were obtained, and I'm not even sure we have them as it's not proven in the conversion script!

I also agree with Niels but let me know which way to refactor

That sounds fair to me, thanks for explaining it. I would also add that we typically don't have all checkpoints tested in integration tests, and having this kind of test in the conversion script also allows to iterate faster while porting the model, as we don't have to run tests all the time.

the number of users using the ViTPose conversion script afterwards is a number very close to 0.

Yeah, but things tend to propagate to newer models too. So maybe it's worth adding a flag to check logits or not, to allow porting other checkpoints as well. What do you think?

Yes adding a boolean flag is fine for me

Okay, sounds good, let's add a flag in that case, it's the best of both worlds 😉

ArthurZucker · 2024-12-13T10:36:59Z

src/transformers/models/vitpose/image_processing_vitpose.py

+        if self.do_affine_transform:
+            new_images = []
+            for image, image_boxes in zip(images, boxes):
+                for box in image_boxes:
+                    center, scale = box_to_center_and_scale(
+                        box,
+                        image_width=size["width"],
+                        image_height=size["height"],
+                        normalize_factor=self.normalize_factor,
+                    )
+                    transformed_image = self.affine_transform(
+                        image, center, scale, rotation=0, size=size, input_data_format=input_data_format
+                    )
+                    new_images.append(transformed_image)
+            images = new_images
+
+        # For batch processing, the number of boxes must be consistent across all images in the batch.
+        # When using a list input, the number of boxes can vary dynamically per image.
+        # The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width)
+
+        if self.do_rescale:
+            images = [
+                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+                for image in images
+            ]
+        if self.do_normalize:
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
+


why don't we have a single for loop here ?

For the clarity:

if self.do_rescale: images = [ self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) for image in images ] if self.do_normalize: images = [ self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) for image in images ]

transform to:

preprocessed_images = [] for image in images: ... if self.do_normalize: image = self.normalize(image, ...) if self.do_rescale: image = self.resclae(image, ...) preprocessed_images.append(image) ...

That would make it inconsistent with any of the other image processors we have, e.g. here for ViTImageProcessor

Some models have had this order refactored already (clip, bit, chameleon, llava*, ...). It's not a big change, but I believe it improves the readability of the code.

It is also supposed to be faster 😉

Ok let's refactor the other ones in that case :)

src/transformers/models/vitpose_backbone/__init__.py

src/transformers/models/vitpose_backbone/modeling_vitpose_backbone.py

ArthurZucker · 2024-12-13T10:42:28Z

src/transformers/models/vitpose_backbone/modeling_vitpose_backbone.py

+        experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)]
+        self.experts = nn.ModuleList(experts)
+
+    def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:


let's go with SwitchTransformersSparseMLP implementation, it should be more efficient see #31173

it's not a must but it's good for both standardization and performances in general!

@ArthurZucker I have read the PR that you linked and understood the concept of reducing the amount of unused expert that used in the inference stage.

This case is little bit different since we have additional input to handle which experts to be handled. However, I have little changed the code

ArthurZucker · 2024-12-13T10:43:45Z

src/transformers/models/vitpose_backbone/modeling_vitpose_backbone.py

+        hidden_states = attention_output + hidden_states
+
+        layer_output = self.layernorm_after(hidden_states)
+        if self.num_experts == 1:


are there any released model that use MOE?

Yes they have several models that contains MoE parameters

ArthurZucker · 2024-12-13T10:44:19Z

src/transformers/models/vitpose_backbone/modeling_vitpose_backbone.py

Looks alright, but we should be using modular transformers. This will make it easier when we refactor attention or whatnot !

Also, you know better than me the other vision models, but let's push standardization as much as we can for thiings that can be translated to another thing!

SangbumChoi · 2024-12-14T01:26:38Z

@ArthurZucker Thanks for the review, I will change the thumbs-up reaction. For the question of backbone part, I think there is no necessary reason for dividing into two independent architecture (This work was done by Niels at the first and he designed this part).

However, I assume that this is because of the intention of the author and Niels. This paper is aimed for human pose estimation based on very strong common backbone and just switching last MLP-like (e.g. head) layer for architecture efficiency to different task and datasets. For instance you can see that many model share same backbone architecture. So I think that this is why Niels build it this way. (@NielsRogge If there is a specific reason for dividing backbone architecture?)

https://github.com/facebookresearch/mae
https://github.com/ViTAE-Transformer/ViTPose?tab=readme-ov-file

NielsRogge · 2024-12-14T18:14:56Z

It's mainly because ViTPose itself is a framework which could leverage various different backbones, so I made it compatible with the AutoBackbone API (similar to frameworks like DETR, DPT and Mask2Former). See here for example usage: https://huggingface.co/docs/transformers/en/model_doc/dpt#usage-tips

…gge/transformers into vitpose-niels

This reverts commit 2c56a48.

ArthurZucker

Thanks for iterating! Just 2 things to check, 1 blocking and the other one is not.
In general @qubvel and @NielsRogge I think we have an effort to make to bring standardization to vision models and both of you know a lot more than me about that!

-> Things like patch embedding, classic decoder layers, etc that are re-implemented everywhere could benefit from that. I might be wrong tho! 😉

Anyways great work everyone! @qubvel feel free to merge once these are adressed and you are happy with the changes!

ArthurZucker · 2024-12-20T08:54:31Z

src/transformers/models/vitpose/image_processing_vitpose.py

+        # When using a list input, the number of boxes can vary dynamically per image.
+        # The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width)
+
+        if self.do_rescale:


Let's update this!

ArthurZucker · 2024-12-20T08:57:55Z

src/transformers/models/vitpose/modeling_vitpose.py

+        # add backbone attributes
+        if not hasattr(self.backbone.config, "hidden_size"):
+            raise ValueError("The backbone should have a hidden_size attribute")
+        if not hasattr(self.backbone.config, "image_size"):
+            raise ValueError("The backbone should have an image_size attribute")
+        if not hasattr(self.backbone.config, "patch_size"):
+            raise ValueError("The backbone should have a patch_size attribute")


config checks should be done in the config, not sure this has a lot of value / can be removed!

ArthurZucker · 2024-12-20T08:58:14Z

src/transformers/models/vitpose/modeling_vitpose.py

+        if not hasattr(self.backbone.config, "patch_size"):
+            raise ValueError("The backbone should have a patch_size attribute")
+
+        self.head = VitPoseSimpleDecoder(config) if config.use_simple_decoder else VitPoseClassicDecoder(config)


are both used in the released checkpoints?

ArthurZucker · 2024-12-20T08:59:23Z

src/transformers/models/vitpose_backbone/modeling_vitpose_backbone.py

+        experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)]
+        self.experts = nn.ModuleList(experts)
+
+    def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:


it's not a must but it's good for both standardization and performances in general!

SangbumChoi · 2024-12-24T03:12:03Z

@qubvel @NielsRogge Hi can you check the changes and also the checkpoints? Always thanks for the support.

qubvel · 2025-01-08T13:34:40Z

Hi @SangbumChoi! Thanks for the fixes, I will push the final fixes and merge the PR!

Checkpoints can be found here:
https://huggingface.co/usyd-community

qubvel · 2025-01-08T16:39:41Z

src/transformers/models/vitpose/configuration_vitpose.py

+        initializer_range: float = 0.02,
+        scale_factor: int = 4,
+        use_simple_decoder: bool = True,
+        **kwargs,


num_labels is the default argument of PretrainedCondig

NielsRogge and others added 30 commits May 25, 2022 12:23

First draft

03e4321

Make fixup

84ac7fe

Make forward pass worké

90018b0

Improve code

5ce0b8b

Fix merge

3009a8a

More improvements

8f39773

More improvements

067f593

Make predictions match

7360c22

More improvements

a1b154a

Improve image processor

4bd07c3

Fix model tests

44f694a

Add classic decoder

41c1778

Merge remote-tracking branch 'upstream/main' into add_vitpose

1773f8d

Convert classic decoder

ceb3d3c

Verify image processor

fedf2cc

Fix classic decoder logits

38dedcd

Clean up

4cdbc03

Add post_process_pose_estimation

95aae6d

Improve post_process_pose_estimation

2531c19

Use AutoBackbone

e06d678

Add support for MoE models

c4a7df1

Fix tests, improve num_experts%

b09592c

Improve variable names

04930ec

Fix merge

3432448

Make fixup

676aa5c

More improvements

547d0da

Improve post_process_pose_estimation

4435fd6

Compute centers and scales

db0e72b

Improve postprocessing

027100d

More improvements

fc8e5e0

qubvel requested a review from ArthurZucker November 22, 2024 17:39

ArthurZucker reviewed Dec 13, 2024

View reviewed changes

SangbumChoi and others added 4 commits December 16, 2024 08:56

fix pos_embed

2c56a48

Merge branch 'add_vitpose_autobackbone' of https://github.com/NielsRo…

c12eb35

…gge/transformers into vitpose-niels

Simplify init

ba7373f

Revert "fix pos_embed"

e2fbb26

This reverts commit 2c56a48.

ArthurZucker approved these changes Dec 20, 2024

View reviewed changes

SangbumChoi added 4 commits December 24, 2024 00:37

refactor single loop

a9bb08f

allow flag to enable custom model

0dd9613

efficiency of MoE to not use unused experts

b21bb06

make style

9a5c86d

qubvel self-requested a review January 8, 2025 11:09

qubvel added 8 commits January 8, 2025 12:45

Fix range -> arange to avoid warning

a5e7966

Revert MOE router, a new one does not work

fa6e613

Fix postprocessing a bit (labels)

f2037ce

Fix type hint

3cbd9e3

Fix docs snippets

fdd080c

Fix links to checkpoints

3cb154c

Fix checkpoints in tests

8a4d9c1

Fix test

09752cf

Add image to docs

1bea6c1

qubvel merged commit 8490d31 into huggingface:main Jan 8, 2025
24 checks passed

qubvel reviewed Jan 8, 2025

View reviewed changes

NielsRogge mentioned this pull request Jan 12, 2025

Add keypoint-detection task #24044

Closed

Add ViTPose #30530

Add ViTPose #30530

Conversation

NielsRogge commented Apr 28, 2024 • edited by qubvel Loading

What does this PR do?

qubvel commented Nov 22, 2024

SangbumChoi commented Dec 3, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qubvel Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SangbumChoi commented Dec 14, 2024 • edited Loading

NielsRogge commented Dec 14, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SangbumChoi commented Dec 24, 2024

qubvel commented Jan 8, 2025

Choose a reason for hiding this comment

NielsRogge commented Apr 28, 2024 •

edited by qubvel

Loading

NielsRogge Dec 16, 2024 •

edited

Loading

qubvel Dec 16, 2024 •

edited

Loading

SangbumChoi commented Dec 14, 2024 •

edited

Loading