-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ViTPose #30530
Add ViTPose #30530
Conversation
@ArthurZucker please review whenever you have time! The notebook to run: https://colab.research.google.com/drive/1e8fcby5rhKZWcr9LSN8mNbQ0TU4Dxxpo?usp=sharing Checkpoints: |
@ArthurZucker Reminder ping :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if you are certain we need to split these. I think we can just add all in vitpose no? Why do we need to have the backbone separate?
Otherwise good work!
# create image processor | ||
image_processor = VitPoseImageProcessor() | ||
|
||
# verify image processor | ||
image = prepare_img() | ||
boxes = [[[412.8, 157.61, 53.05, 138.01], [384.43, 172.21, 15.12, 35.74]]] | ||
pixel_values = image_processor(images=image, boxes=boxes, return_tensors="pt").pixel_values | ||
|
||
filepath = hf_hub_download(repo_id="nielsr/test-image", filename="vitpose_batch_data.pt", repo_type="dataset") | ||
original_pixel_values = torch.load(filepath, map_location="cpu")["img"] | ||
assert torch.allclose(pixel_values, original_pixel_values, atol=1e-1) | ||
|
||
dataset_index = torch.tensor([0]) | ||
|
||
with torch.no_grad(): | ||
# first forward pass | ||
outputs = model(pixel_values, dataset_index=dataset_index) | ||
output_heatmap = outputs.heatmaps | ||
|
||
# second forward pass (flipped) | ||
# this is done since the model uses `flip_test=True` in its test config | ||
pixel_values_flipped = torch.flip(pixel_values, [3]) | ||
outputs_flipped = model( | ||
pixel_values_flipped, | ||
dataset_index=dataset_index, | ||
flip_pairs=torch.tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16]]), | ||
) | ||
output_flipped_heatmap = outputs_flipped.heatmaps | ||
|
||
outputs.heatmaps = (output_heatmap + output_flipped_heatmap) * 0.5 | ||
|
||
# Verify pose_results | ||
pose_results = image_processor.post_process_pose_estimation(outputs, boxes=boxes)[0] | ||
|
||
if model_name == "vitpose-base-simple": | ||
assert torch.allclose( | ||
pose_results[1]["keypoints"][0], | ||
torch.tensor([3.98180511e02, 1.81808380e02]), | ||
atol=5e-2, | ||
) | ||
assert torch.allclose( | ||
pose_results[1]["scores"][0], | ||
torch.tensor([8.66642594e-01]), | ||
atol=5e-2, | ||
) | ||
elif model_name == "vitpose-base": | ||
assert torch.allclose( | ||
pose_results[1]["keypoints"][0], | ||
torch.tensor([3.9807913e02, 1.8182812e02]), | ||
atol=5e-2, | ||
) | ||
assert torch.allclose( | ||
pose_results[1]["scores"][0], | ||
torch.tensor([8.8235235e-01]), | ||
atol=5e-2, | ||
) | ||
elif model_name == "vitpose-base-coco-aic-mpii": | ||
assert torch.allclose( | ||
pose_results[1]["keypoints"][0], | ||
torch.tensor([3.98305542e02, 1.81741592e02]), | ||
atol=5e-2, | ||
) | ||
assert torch.allclose( | ||
pose_results[1]["scores"][0], | ||
torch.tensor([8.69966745e-01]), | ||
atol=5e-2, | ||
) | ||
elif model_name == "vitpose-plus-base": | ||
assert torch.allclose( | ||
pose_results[1]["keypoints"][0], | ||
torch.tensor([3.98201294e02, 1.81728302e02]), | ||
atol=5e-2, | ||
) | ||
assert torch.allclose( | ||
pose_results[1]["scores"][0], | ||
torch.tensor([8.75046968e-01]), | ||
atol=5e-2, | ||
) | ||
else: | ||
raise ValueError("Model not supported") | ||
print("Conversion successfully done.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as always let's put this in tests rather than here. it's not part of the conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean erase in this convert.py
? and put it in the test? or put both here and test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's erase it here and leave it only in tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this, I'd always put these in conversion scripts.... especially since we are still tweaking the architecture and things can break + the number of users using the ViTPose conversion script afterwards is a number very close to 0.
another reason is that you have a single source of truth to showcase to people who want to know how we obtained equivalent logits.
By just adding an integration test, you lose all that information, causing breaking changes such as this one. For models like Pixtral but also SAM, it's unclear to me how equivalent logits were obtained, and I'm not even sure we have them as it's not proven in the conversion script!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree with Niels but let me know which way to refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fair to me, thanks for explaining it. I would also add that we typically don't have all checkpoints tested in integration tests, and having this kind of test in the conversion script also allows to iterate faster while porting the model, as we don't have to run tests all the time.
the number of users using the ViTPose conversion script afterwards is a number very close to 0.
Yeah, but things tend to propagate to newer models too. So maybe it's worth adding a flag to check logits or not, to allow porting other checkpoints as well. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes adding a boolean flag is fine for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds good, let's add a flag in that case, it's the best of both worlds 😉
if self.do_affine_transform: | ||
new_images = [] | ||
for image, image_boxes in zip(images, boxes): | ||
for box in image_boxes: | ||
center, scale = box_to_center_and_scale( | ||
box, | ||
image_width=size["width"], | ||
image_height=size["height"], | ||
normalize_factor=self.normalize_factor, | ||
) | ||
transformed_image = self.affine_transform( | ||
image, center, scale, rotation=0, size=size, input_data_format=input_data_format | ||
) | ||
new_images.append(transformed_image) | ||
images = new_images | ||
|
||
# For batch processing, the number of boxes must be consistent across all images in the batch. | ||
# When using a list input, the number of boxes can vary dynamically per image. | ||
# The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width) | ||
|
||
if self.do_rescale: | ||
images = [ | ||
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) | ||
for image in images | ||
] | ||
if self.do_normalize: | ||
images = [ | ||
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) | ||
for image in images | ||
] | ||
|
||
images = [ | ||
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we have a single for loop here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the clarity:
if self.do_rescale:
images = [
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
for image in images
]
if self.do_normalize:
images = [
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
for image in images
]
transform to:
preprocessed_images = []
for image in images:
...
if self.do_normalize:
image = self.normalize(image, ...)
if self.do_rescale:
image = self.resclae(image, ...)
preprocessed_images.append(image)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make it inconsistent with any of the other image processors we have, e.g. here for ViTImageProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some models have had this order refactored already (clip, bit, chameleon, llava*, ...). It's not a big change, but I believe it improves the readability of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also supposed to be faster 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let's refactor the other ones in that case :)
experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)] | ||
self.experts = nn.ModuleList(experts) | ||
|
||
def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go with SwitchTransformersSparseMLP
implementation, it should be more efficient see #31173
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not a must but it's good for both standardization and performances in general!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker I have read the PR that you linked and understood the concept of reducing the amount of unused expert that used in the inference stage.
This case is little bit different since we have additional input to handle which experts to be handled. However, I have little changed the code
hidden_states = attention_output + hidden_states | ||
|
||
layer_output = self.layernorm_after(hidden_states) | ||
if self.num_experts == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any released model that use MOE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they have several models that contains MoE parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks alright, but we should be using modular transformers. This will make it easier when we refactor attention or whatnot !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you know better than me the other vision models, but let's push standardization as much as we can for thiings that can be translated to another thing!
@ArthurZucker Thanks for the review, I will change the thumbs-up reaction. For the question of backbone part, I think there is no necessary reason for dividing into two independent architecture (This work was done by Niels at the first and he designed this part). However, I assume that this is because of the intention of the author and Niels. This paper is aimed for human pose estimation based on very strong common backbone and just switching last MLP-like (e.g. head) layer for architecture efficiency to different task and datasets. For instance you can see that many model share same backbone architecture. So I think that this is why Niels build it this way. (@NielsRogge If there is a specific reason for dividing backbone architecture?) https://github.com/facebookresearch/mae |
It's mainly because ViTPose itself is a framework which could leverage various different backbones, so I made it compatible with the |
This reverts commit 2c56a48.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! Just 2 things to check, 1 blocking and the other one is not.
In general @qubvel and @NielsRogge I think we have an effort to make to bring standardization to vision models and both of you know a lot more than me about that!
-> Things like patch embedding, classic decoder layers, etc that are re-implemented everywhere could benefit from that. I might be wrong tho! 😉
Anyways great work everyone! @qubvel feel free to merge once these are adressed and you are happy with the changes!
# When using a list input, the number of boxes can vary dynamically per image. | ||
# The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width) | ||
|
||
if self.do_rescale: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update this!
# add backbone attributes | ||
if not hasattr(self.backbone.config, "hidden_size"): | ||
raise ValueError("The backbone should have a hidden_size attribute") | ||
if not hasattr(self.backbone.config, "image_size"): | ||
raise ValueError("The backbone should have an image_size attribute") | ||
if not hasattr(self.backbone.config, "patch_size"): | ||
raise ValueError("The backbone should have a patch_size attribute") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config checks should be done in the config, not sure this has a lot of value / can be removed!
if not hasattr(self.backbone.config, "patch_size"): | ||
raise ValueError("The backbone should have a patch_size attribute") | ||
|
||
self.head = VitPoseSimpleDecoder(config) if config.use_simple_decoder else VitPoseClassicDecoder(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are both used in the released checkpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is!
experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)] | ||
self.experts = nn.ModuleList(experts) | ||
|
||
def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not a must but it's good for both standardization and performances in general!
@qubvel @NielsRogge Hi can you check the changes and also the checkpoints? Always thanks for the support. |
Hi @SangbumChoi! Thanks for the fixes, I will push the final fixes and merge the PR! Checkpoints can be found here: |
initializer_range: float = 0.02, | ||
scale_factor: int = 4, | ||
use_simple_decoder: bool = True, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_labels is the default argument of PretrainedCondig
What does this PR do?
This PR adds ViTPose as introduced in ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.
Here's a demo notebook - note that the API might change:
To do: