Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ViTPose #30530

Merged
merged 201 commits into from
Jan 8, 2025
Merged

Add ViTPose #30530

merged 201 commits into from
Jan 8, 2025

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Apr 28, 2024

What does this PR do?

This PR adds ViTPose as introduced in ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.

Here's a demo notebook - note that the API might change:

To do:

  • get rid of cv2 dependency (?)

@qubvel qubvel requested a review from ArthurZucker November 22, 2024 17:39
@SangbumChoi
Copy link
Contributor

@ArthurZucker Reminder ping :)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if you are certain we need to split these. I think we can just add all in vitpose no? Why do we need to have the backbone separate?
Otherwise good work!

src/transformers/models/vitpose/__init__.py Show resolved Hide resolved
Comment on lines 229 to 309
# create image processor
image_processor = VitPoseImageProcessor()

# verify image processor
image = prepare_img()
boxes = [[[412.8, 157.61, 53.05, 138.01], [384.43, 172.21, 15.12, 35.74]]]
pixel_values = image_processor(images=image, boxes=boxes, return_tensors="pt").pixel_values

filepath = hf_hub_download(repo_id="nielsr/test-image", filename="vitpose_batch_data.pt", repo_type="dataset")
original_pixel_values = torch.load(filepath, map_location="cpu")["img"]
assert torch.allclose(pixel_values, original_pixel_values, atol=1e-1)

dataset_index = torch.tensor([0])

with torch.no_grad():
# first forward pass
outputs = model(pixel_values, dataset_index=dataset_index)
output_heatmap = outputs.heatmaps

# second forward pass (flipped)
# this is done since the model uses `flip_test=True` in its test config
pixel_values_flipped = torch.flip(pixel_values, [3])
outputs_flipped = model(
pixel_values_flipped,
dataset_index=dataset_index,
flip_pairs=torch.tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16]]),
)
output_flipped_heatmap = outputs_flipped.heatmaps

outputs.heatmaps = (output_heatmap + output_flipped_heatmap) * 0.5

# Verify pose_results
pose_results = image_processor.post_process_pose_estimation(outputs, boxes=boxes)[0]

if model_name == "vitpose-base-simple":
assert torch.allclose(
pose_results[1]["keypoints"][0],
torch.tensor([3.98180511e02, 1.81808380e02]),
atol=5e-2,
)
assert torch.allclose(
pose_results[1]["scores"][0],
torch.tensor([8.66642594e-01]),
atol=5e-2,
)
elif model_name == "vitpose-base":
assert torch.allclose(
pose_results[1]["keypoints"][0],
torch.tensor([3.9807913e02, 1.8182812e02]),
atol=5e-2,
)
assert torch.allclose(
pose_results[1]["scores"][0],
torch.tensor([8.8235235e-01]),
atol=5e-2,
)
elif model_name == "vitpose-base-coco-aic-mpii":
assert torch.allclose(
pose_results[1]["keypoints"][0],
torch.tensor([3.98305542e02, 1.81741592e02]),
atol=5e-2,
)
assert torch.allclose(
pose_results[1]["scores"][0],
torch.tensor([8.69966745e-01]),
atol=5e-2,
)
elif model_name == "vitpose-plus-base":
assert torch.allclose(
pose_results[1]["keypoints"][0],
torch.tensor([3.98201294e02, 1.81728302e02]),
atol=5e-2,
)
assert torch.allclose(
pose_results[1]["scores"][0],
torch.tensor([8.75046968e-01]),
atol=5e-2,
)
else:
raise ValueError("Model not supported")
print("Conversion successfully done.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as always let's put this in tests rather than here. it's not part of the conversion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean erase in this convert.py? and put it in the test? or put both here and test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's erase it here and leave it only in tests

Copy link
Contributor Author

@NielsRogge NielsRogge Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this, I'd always put these in conversion scripts.... especially since we are still tweaking the architecture and things can break + the number of users using the ViTPose conversion script afterwards is a number very close to 0.

another reason is that you have a single source of truth to showcase to people who want to know how we obtained equivalent logits.

By just adding an integration test, you lose all that information, causing breaking changes such as this one. For models like Pixtral but also SAM, it's unclear to me how equivalent logits were obtained, and I'm not even sure we have them as it's not proven in the conversion script!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with Niels but let me know which way to refactor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fair to me, thanks for explaining it. I would also add that we typically don't have all checkpoints tested in integration tests, and having this kind of test in the conversion script also allows to iterate faster while porting the model, as we don't have to run tests all the time.

the number of users using the ViTPose conversion script afterwards is a number very close to 0.

Yeah, but things tend to propagate to newer models too. So maybe it's worth adding a flag to check logits or not, to allow porting other checkpoints as well. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes adding a boolean flag is fine for me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds good, let's add a flag in that case, it's the best of both worlds 😉

Comment on lines 511 to 545
if self.do_affine_transform:
new_images = []
for image, image_boxes in zip(images, boxes):
for box in image_boxes:
center, scale = box_to_center_and_scale(
box,
image_width=size["width"],
image_height=size["height"],
normalize_factor=self.normalize_factor,
)
transformed_image = self.affine_transform(
image, center, scale, rotation=0, size=size, input_data_format=input_data_format
)
new_images.append(transformed_image)
images = new_images

# For batch processing, the number of boxes must be consistent across all images in the batch.
# When using a list input, the number of boxes can vary dynamically per image.
# The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width)

if self.do_rescale:
images = [
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
for image in images
]
if self.do_normalize:
images = [
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
for image in images
]

images = [
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we have a single for loop here ?

Copy link
Member

@qubvel qubvel Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the clarity:

        if self.do_rescale:
            images = [
                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
                for image in images
            ]
        if self.do_normalize:
            images = [
                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
                for image in images
            ]

transform to:

preprocessed_images = []
for image in images:

    ...
    if self.do_normalize:
        image = self.normalize(image, ...)
    if self.do_rescale:
        image = self.resclae(image, ...)
    
    preprocessed_images.append(image)
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make it inconsistent with any of the other image processors we have, e.g. here for ViTImageProcessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some models have had this order refactored already (clip, bit, chameleon, llava*, ...). It's not a big change, but I believe it improves the readability of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also supposed to be faster 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's refactor the other ones in that case :)

src/transformers/models/vitpose_backbone/__init__.py Outdated Show resolved Hide resolved
experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)]
self.experts = nn.ModuleList(experts)

def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go with SwitchTransformersSparseMLP implementation, it should be more efficient see #31173

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a must but it's good for both standardization and performances in general!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker I have read the PR that you linked and understood the concept of reducing the amount of unused expert that used in the inference stage.

This case is little bit different since we have additional input to handle which experts to be handled. However, I have little changed the code

hidden_states = attention_output + hidden_states

layer_output = self.layernorm_after(hidden_states)
if self.num_experts == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any released model that use MOE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they have several models that contains MoE parameters

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks alright, but we should be using modular transformers. This will make it easier when we refactor attention or whatnot !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you know better than me the other vision models, but let's push standardization as much as we can for thiings that can be translated to another thing!

@SangbumChoi
Copy link
Contributor

SangbumChoi commented Dec 14, 2024

@ArthurZucker Thanks for the review, I will change the thumbs-up reaction. For the question of backbone part, I think there is no necessary reason for dividing into two independent architecture (This work was done by Niels at the first and he designed this part).

However, I assume that this is because of the intention of the author and Niels. This paper is aimed for human pose estimation based on very strong common backbone and just switching last MLP-like (e.g. head) layer for architecture efficiency to different task and datasets. For instance you can see that many model share same backbone architecture. So I think that this is why Niels build it this way. (@NielsRogge If there is a specific reason for dividing backbone architecture?)

https://github.com/facebookresearch/mae
https://github.com/ViTAE-Transformer/ViTPose?tab=readme-ov-file

@NielsRogge
Copy link
Contributor Author

It's mainly because ViTPose itself is a framework which could leverage various different backbones, so I made it compatible with the AutoBackbone API (similar to frameworks like DETR, DPT and Mask2Former). See here for example usage: https://huggingface.co/docs/transformers/en/model_doc/dpt#usage-tips

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Just 2 things to check, 1 blocking and the other one is not.
In general @qubvel and @NielsRogge I think we have an effort to make to bring standardization to vision models and both of you know a lot more than me about that!

-> Things like patch embedding, classic decoder layers, etc that are re-implemented everywhere could benefit from that. I might be wrong tho! 😉

Anyways great work everyone! @qubvel feel free to merge once these are adressed and you are happy with the changes!

# When using a list input, the number of boxes can vary dynamically per image.
# The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width)

if self.do_rescale:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update this!

Comment on lines +247 to +253
# add backbone attributes
if not hasattr(self.backbone.config, "hidden_size"):
raise ValueError("The backbone should have a hidden_size attribute")
if not hasattr(self.backbone.config, "image_size"):
raise ValueError("The backbone should have an image_size attribute")
if not hasattr(self.backbone.config, "patch_size"):
raise ValueError("The backbone should have a patch_size attribute")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config checks should be done in the config, not sure this has a lot of value / can be removed!

if not hasattr(self.backbone.config, "patch_size"):
raise ValueError("The backbone should have a patch_size attribute")

self.head = VitPoseSimpleDecoder(config) if config.use_simple_decoder else VitPoseClassicDecoder(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are both used in the released checkpoints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is!

experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)]
self.experts = nn.ModuleList(experts)

def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a must but it's good for both standardization and performances in general!

@SangbumChoi
Copy link
Contributor

@qubvel @NielsRogge Hi can you check the changes and also the checkpoints? Always thanks for the support.

@qubvel qubvel self-requested a review January 8, 2025 11:09
@qubvel
Copy link
Member

qubvel commented Jan 8, 2025

Hi @SangbumChoi! Thanks for the fixes, I will push the final fixes and merge the PR!

Checkpoints can be found here:
https://huggingface.co/usyd-community

@qubvel qubvel merged commit 8490d31 into huggingface:main Jan 8, 2025
24 checks passed
initializer_range: float = 0.02,
scale_factor: int = 4,
use_simple_decoder: bool = True,
**kwargs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_labels is the default argument of PretrainedCondig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants